Importing Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

%matplotlib inline

file_path = 'Ethereum Merged Data.csv'
ethereum_data = pd.read_csv(file_path)

ethereum_data['Timestamp'] = pd.to_datetime(ethereum_data['Timestamp'])

numeric_columns = ['Price', 'Volume', 'Market Cap']  # specify the numeric columns
ethereum_data[numeric_columns] = ethereum_data[numeric_columns].apply(pd.to_numeric, errors='coerce')

ethereum_data.fillna(ethereum_data.median(), inplace=True)

z_scores = stats.zscore(ethereum_data[numeric_columns], nan_policy='omit')
outliers = np.abs(z_scores) > 3

print("Outliers detected:\n", ethereum_data[outliers.any(axis=1)])
Outliers detected:
       Timestamp        Price        Volume    Market Cap
1786 2020-09-11   367.638929  7.474742e+10  4.130033e+10
1901 2021-01-04   967.000597  1.409065e+11  1.125254e+11
1902 2021-01-05  1025.654768  6.228514e+10  1.166932e+11
1903 2021-01-06  1103.358252  4.714825e+10  1.251129e+11
1904 2021-01-07  1208.575093  4.788685e+10  1.373068e+11
...         ...          ...           ...           ...
2240 2021-12-09  4431.540647  1.966195e+10  5.258504e+11
2394 2022-05-12  2080.910244  4.654801e+10  2.500668e+11
2427 2022-06-14  1205.595286  4.757173e+10  1.460727e+11
2692 2023-03-06  1563.225662  6.217285e+10  1.883375e+11
2700 2023-03-14  1678.915634  6.521171e+10  2.022942e+11

[85 rows x 4 columns]
C:\Users\Luke Holmes\AppData\Local\Temp\ipykernel_2688\4234391317.py:17: FutureWarning: DataFrame.mean and DataFrame.median with numeric_only=None will include datetime64 and datetime64tz columns in a future version.
  ethereum_data.fillna(ethereum_data.median(), inplace=True)
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Importing Necessary Libraries¶

The following libraries are essential for data manipulation, visualization, and statistical analysis:

  • pandas: For data manipulation and ingestion.
  • numpy: Provides support for efficient numerical computation.
  • matplotlib.pyplot and seaborn: For plotting graphs that are visually appealing.
  • scipy.stats: For statistical functions.

%matplotlib inline is a magic function that renders the figure in a notebook (instead of displaying a dump of the figure object).

Loading and Preparing Data¶

  • file_path variable holds the location of the dataset.
  • ethereum_data = pd.read_csv(file_path): Reads the Ethereum dataset into a DataFrame.
  • Converts the 'Timestamp' column to datetime format to enable time-series analysis.

Cleaning and Preprocessing Data¶

  • numeric_columns identifies the columns with numerical data.
  • apply(pd.to_numeric, errors='coerce'): Converts the values in these columns to numeric types, coercing errors by replacing non-numeric values with NaN.
  • Fills missing values using the median of each column which is more robust to outliers than the mean.
  • Note: A future warning indicates changes in handling datetime columns during aggregation functions in upcoming pandas versions.

Detecting Outliers¶

  • Standardizes the numeric columns to have mean zero and unit variance.
  • Calculates the z-scores for each numeric attribute.
  • Considers points with a z-score absolute value greater than 3 as outliers (common rule of thumb).
  • Displays rows in the dataset that contain outliers in any of the specified numeric columns.

  • McKinney, W. (2017). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O'Reilly Media.

  • VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for Working with Data. O'Reilly Media.
In [ ]:
 

Working with the OS Library and Current Working Directory

In [3]:
import os

current_directory = os.getcwd()

print("Current Working Directory:", current_directory)
Current Working Directory: C:\Users\Luke Holmes

Reading and Displaying Data from a CSV File

In [4]:
print(ethereum_data.head())
   Timestamp     Price         Volume    Market Cap
0 2015-10-21  0.439769  599041.013152  3.259030e+07
1 2015-10-22  0.565462  979304.072423  4.191854e+07
2 2015-10-23  0.540738  866798.488854  4.010011e+07
3 2015-10-24  0.568574  259157.662411  4.217907e+07
4 2015-10-25  0.631939  476617.738601  4.689593e+07

Analyzing the Structure and Basic Statistics of Ethereum Data

In [5]:
ethereum_data.shape
ethereum_data.describe
ethereum_data.head
Out[5]:
<bound method NDFrame.head of       Timestamp        Price        Volume    Market Cap
0    2015-10-21     0.439769  5.990410e+05  3.259030e+07
1    2015-10-22     0.565462  9.793041e+05  4.191854e+07
2    2015-10-23     0.540738  8.667985e+05  4.010011e+07
3    2015-10-24     0.568574  2.591577e+05  4.217907e+07
4    2015-10-25     0.631939  4.766177e+05  4.689593e+07
...         ...          ...           ...           ...
2959 2023-11-28  2030.000506  1.922688e+10  2.437204e+11
2960 2023-11-29  2048.535257  1.642457e+10  2.462628e+11
2961 2023-11-30  2025.937328  1.309906e+10  2.440015e+11
2962 2023-12-01  2051.756718  1.162592e+10  2.468482e+11
2963 2023-12-02  2085.712361  1.991308e+10  2.506407e+11

[2964 rows x 4 columns]>
In [6]:
ethereum_data['Timestamp'] = pd.to_datetime(ethereum_data['Timestamp'])

ethereum_data['year'] = ethereum_data['Timestamp'].dt.year
ethereum_data['month'] = ethereum_data['Timestamp'].dt.month
ethereum_data['day'] = ethereum_data['Timestamp'].dt.day
ethereum_data['weekday'] = ethereum_data['Timestamp'].dt.weekday
ethereum_data['hour'] = ethereum_data['Timestamp'].dt.hour  # if hour is relevant

ethereum_data['price_volume_interaction'] = ethereum_data['Price'] * ethereum_data['Volume']

ethereum_data['marketcap_volume_ratio'] = ethereum_data['Market Cap'] / ethereum_data['Volume']
ethereum_data['price_change'] = ethereum_data['Price'].diff() 
ethereum_data['volume_change'] = ethereum_data['Volume'].diff() 

print(ethereum_data.head())
   Timestamp     Price         Volume    Market Cap  year  month  day  \
0 2015-10-21  0.439769  599041.013152  3.259030e+07  2015     10   21   
1 2015-10-22  0.565462  979304.072423  4.191854e+07  2015     10   22   
2 2015-10-23  0.540738  866798.488854  4.010011e+07  2015     10   23   
3 2015-10-24  0.568574  259157.662411  4.217907e+07  2015     10   24   
4 2015-10-25  0.631939  476617.738601  4.689593e+07  2015     10   25   

   weekday  hour  price_volume_interaction  marketcap_volume_ratio  \
0        2     0             263439.651980               54.404118   
1        3     0             553758.756288               42.804418   
2        4     0             468710.679129               46.262318   
3        5     0             147350.248756              162.754481   
4        6     0             301193.246187               98.393162   

   price_change  volume_change  
0           NaN            NaN  
1      0.125693  380263.059271  
2     -0.024724 -112505.583569  
3      0.027836 -607640.826444  
4      0.063365  217460.076191  
In [7]:
ed=ethereum_data
In [8]:
ed.dropna()
(ed.shape)
(ed.head())
(ed.tail())
Out[8]:
Timestamp Price Volume Market Cap year month day weekday hour price_volume_interaction marketcap_volume_ratio price_change volume_change
2959 2023-11-28 2030.000506 1.922688e+10 2.437204e+11 2023 11 28 1 0 3.903058e+13 12.676024 -34.073712 7.151175e+09
2960 2023-11-29 2048.535257 1.642457e+10 2.462628e+11 2023 11 29 2 0 3.364632e+13 14.993557 18.534751 -2.802307e+09
2961 2023-11-30 2025.937328 1.309906e+10 2.440015e+11 2023 11 30 3 0 2.653788e+13 18.627399 -22.597929 -3.325512e+09
2962 2023-12-01 2051.756718 1.162592e+10 2.468482e+11 2023 12 1 4 0 2.385355e+13 21.232583 25.819390 -1.473147e+09
2963 2023-12-02 2085.712361 1.991308e+10 2.506407e+11 2023 12 2 5 0 4.153296e+13 12.586734 33.955643 8.287165e+09
In [9]:
print(ed.isnull().sum())
ed.dtypes
Timestamp                   0
Price                       0
Volume                      0
Market Cap                  0
year                        0
month                       0
day                         0
weekday                     0
hour                        0
price_volume_interaction    0
marketcap_volume_ratio      0
price_change                1
volume_change               1
dtype: int64
Out[9]:
Timestamp                   datetime64[ns]
Price                              float64
Volume                             float64
Market Cap                         float64
year                                 int64
month                                int64
day                                  int64
weekday                              int64
hour                                 int64
price_volume_interaction           float64
marketcap_volume_ratio             float64
price_change                       float64
volume_change                      float64
dtype: object

Data Cleaning

In [10]:
ed.rename(columns={'Timestamp': 'date'}, inplace=True)
print(ed.head())
        date     Price         Volume    Market Cap  year  month  day  \
0 2015-10-21  0.439769  599041.013152  3.259030e+07  2015     10   21   
1 2015-10-22  0.565462  979304.072423  4.191854e+07  2015     10   22   
2 2015-10-23  0.540738  866798.488854  4.010011e+07  2015     10   23   
3 2015-10-24  0.568574  259157.662411  4.217907e+07  2015     10   24   
4 2015-10-25  0.631939  476617.738601  4.689593e+07  2015     10   25   

   weekday  hour  price_volume_interaction  marketcap_volume_ratio  \
0        2     0             263439.651980               54.404118   
1        3     0             553758.756288               42.804418   
2        4     0             468710.679129               46.262318   
3        5     0             147350.248756              162.754481   
4        6     0             301193.246187               98.393162   

   price_change  volume_change  
0           NaN            NaN  
1      0.125693  380263.059271  
2     -0.024724 -112505.583569  
3      0.027836 -607640.826444  
4      0.063365  217460.076191  
In [11]:
ed['year'] = pd.to_datetime(ed['date']).dt.year
ed['month'] = pd.to_datetime(ed['date']).dt.month
ed['day'] = pd.to_datetime(ed['date']).dt.day
ed['weekday'] = pd.to_datetime(ed['date']).dt.weekday
In [ ]:
 

Advanced Feature Engineering for Ethereum Price Prediction

In [12]:
ed['price_7day_avg'] = ed['Price'].rolling(window=7).mean()
ed['volume_7day_avg'] = ed['Volume'].rolling(window=7).mean()
ed['price_change_pct'] = ed['Price'].pct_change() * 100
ed['volume_change_pct'] = ed['Volume'].pct_change() * 100
ed['market_cap_volume_ratio'] = ed['Market Cap'] / ed['Volume']
ed['price_lag1'] = ed['Price'].shift(1)
ed['volume_lag1'] = ed['Volume'].shift(1)
ed['price_ema_short'] = ed['Price'].ewm(span=12, adjust=False).mean()  
ed['price_ema_long'] = ed['Price'].ewm(span=26, adjust=False).mean() 
delta = ed['Price'].diff()
up, down = delta.copy(), delta.copy()
up[up < 0] = 0
down[down > 0] = 0
ed['rsi'] = 100 - (100 / (1 + up.rolling(window=14).mean() / down.abs().rolling(window=14).mean()))
ed['week_of_year'] = pd.to_datetime(ed['date']).dt.isocalendar().week
ed['quarter'] = pd.to_datetime(ed['date']).dt.quarter
ed['days_since_launch'] = (pd.to_datetime(ed['date']) - pd.to_datetime(ed['date']).min()).dt.days
ed['cumulative_return'] = (1 + ed['Price'].pct_change()).cumprod()
ed['cumulative_volume'] = ed['Volume'].cumsum()
ed['price_ma_ratio'] = ed['Price'] / ed['Price'].rolling(window=20).mean()
ed['normalized_price'] = ed['Price'] / ed['Price'].max()
print(ethereum_data.columns)
Index(['date', 'Price', 'Volume', 'Market Cap', 'year', 'month', 'day',
       'weekday', 'hour', 'price_volume_interaction', 'marketcap_volume_ratio',
       'price_change', 'volume_change', 'price_7day_avg', 'volume_7day_avg',
       'price_change_pct', 'volume_change_pct', 'market_cap_volume_ratio',
       'price_lag1', 'volume_lag1', 'price_ema_short', 'price_ema_long', 'rsi',
       'week_of_year', 'quarter', 'days_since_launch', 'cumulative_return',
       'cumulative_volume', 'price_ma_ratio', 'normalized_price'],
      dtype='object')
In [13]:
ethereum_data['price_x_volume'] = ethereum_data['Price'] * ethereum_data['Volume']
ethereum_data['marketcap_per_volume'] = ethereum_data['Market Cap'] / ethereum_data['Volume']
ethereum_data['price_squared'] = ethereum_data['Price'] ** 2
ethereum_data['price_change_pct'] = ethereum_data['Price'].pct_change()
ethereum_data['price_change_pct_x_volume'] = ethereum_data['price_change_pct'] * ethereum_data['Volume']
print(ethereum_data.head())
        date     Price         Volume    Market Cap  year  month  day  \
0 2015-10-21  0.439769  599041.013152  3.259030e+07  2015     10   21   
1 2015-10-22  0.565462  979304.072423  4.191854e+07  2015     10   22   
2 2015-10-23  0.540738  866798.488854  4.010011e+07  2015     10   23   
3 2015-10-24  0.568574  259157.662411  4.217907e+07  2015     10   24   
4 2015-10-25  0.631939  476617.738601  4.689593e+07  2015     10   25   

   weekday  hour  price_volume_interaction  ...  quarter  days_since_launch  \
0        2     0             263439.651980  ...        4                  0   
1        3     0             553758.756288  ...        4                  1   
2        4     0             468710.679129  ...        4                  2   
3        5     0             147350.248756  ...        4                  3   
4        6     0             301193.246187  ...        4                  4   

   cumulative_return  cumulative_volume  price_ma_ratio  normalized_price  \
0                NaN       5.990410e+05             NaN          0.000091   
1           1.285815       1.578345e+06             NaN          0.000117   
2           1.229595       2.445144e+06             NaN          0.000112   
3           1.292892       2.704301e+06             NaN          0.000118   
4           1.436979       3.180919e+06             NaN          0.000131   

   price_x_volume  marketcap_per_volume  price_squared  \
0   263439.651980             54.404118       0.193397   
1   553758.756288             42.804418       0.319747   
2   468710.679129             46.262318       0.292397   
3   147350.248756            162.754481       0.323276   
4   301193.246187             98.393162       0.399347   

   price_change_pct_x_volume  
0                        NaN  
1              279899.710740  
2              -37899.132146  
3               13340.871634  
4               53116.946443  

[5 rows x 34 columns]
In [14]:
ethereum_data['ma_price_x_ma_volume'] = ethereum_data['price_7day_avg'] * ethereum_data['volume_7day_avg']
ethereum_data['rsi_x_price_change_pct'] = ethereum_data['rsi'] * ethereum_data['price_change_pct']
ethereum_data['return_volume_ratio'] = ethereum_data['cumulative_return'] / ethereum_data['cumulative_volume']
print(ethereum_data[['ma_price_x_ma_volume', 'rsi_x_price_change_pct', 'return_volume_ratio']].head())
   ma_price_x_ma_volume  rsi_x_price_change_pct  return_volume_ratio
0                   NaN                     NaN                  NaN
1                   NaN                     NaN         8.146602e-07
2                   NaN                     NaN         5.028723e-07
3                   NaN                     NaN         4.780873e-07
4                   NaN                     NaN         4.517497e-07
In [15]:
ethereum_data['rsi_squared'] = ethereum_data['rsi'] ** 2
ethereum_data['rsi_cubed'] = ethereum_data['rsi'] ** 3
ethereum_data['rsi_squared_x_price'] = ethereum_data['rsi_squared'] * ethereum_data['Price']
In [16]:
ethereum_data['is_Q4'] = ethereum_data['quarter'].apply(lambda x: 1 if x == 4 else 0)
ethereum_data['is_start_of_year'] = ethereum_data['month'].apply(lambda x: 1 if x == 1 else 0)
ethereum_data['Q4_volume_change'] = ethereum_data['is_Q4'] * ethereum_data['volume_change']
In [17]:
high_volume_median = ethereum_data['Volume'].median()
ethereum_data['high_volume_price_change'] = ethereum_data.apply(lambda x: x['price_change'] if x['Volume'] > high_volume_median else 0, axis=1)
In [18]:
ed.head(50)
Out[18]:
date Price Volume Market Cap year month day weekday hour price_volume_interaction ... ma_price_x_ma_volume rsi_x_price_change_pct return_volume_ratio rsi_squared rsi_cubed rsi_squared_x_price is_Q4 is_start_of_year Q4_volume_change high_volume_price_change
0 2015-10-21 0.439769 5.990410e+05 3.259030e+07 2015 10 21 2 0 2.634397e+05 ... NaN NaN NaN NaN NaN NaN 1 0 NaN 0.0
1 2015-10-22 0.565462 9.793041e+05 4.191854e+07 2015 10 22 3 0 5.537588e+05 ... NaN NaN 8.146602e-07 NaN NaN NaN 1 0 3.802631e+05 0.0
2 2015-10-23 0.540738 8.667985e+05 4.010011e+07 2015 10 23 4 0 4.687107e+05 ... NaN NaN 5.028723e-07 NaN NaN NaN 1 0 -1.125056e+05 0.0
3 2015-10-24 0.568574 2.591577e+05 4.217907e+07 2015 10 24 5 0 1.473502e+05 ... NaN NaN 4.780873e-07 NaN NaN NaN 1 0 -6.076408e+05 0.0
4 2015-10-25 0.631939 4.766177e+05 4.689593e+07 2015 10 25 6 0 3.011932e+05 ... NaN NaN 4.517497e-07 NaN NaN NaN 1 0 2.174601e+05 0.0
5 2015-10-26 0.743958 1.174027e+06 5.522699e+07 2015 10 26 0 0 8.734265e+05 ... NaN NaN 3.884553e-07 NaN NaN NaN 1 0 6.974091e+05 0.0
6 2015-10-27 0.854455 1.887569e+06 6.345150e+07 2015 10 27 1 0 1.612843e+06 ... 5.535319e+05 NaN 3.112470e-07 NaN NaN NaN 1 0 7.135421e+05 0.0
7 2015-10-28 1.010410 2.447634e+06 7.505796e+07 2015 10 28 2 0 2.473113e+06 ... 8.116759e+05 NaN 2.643904e-07 NaN NaN NaN 1 0 5.600649e+05 0.0
8 2015-10-29 1.163749 2.236842e+06 8.647898e+07 2015 10 29 3 0 2.603122e+06 ... 1.051975e+06 NaN 2.421776e-07 NaN NaN NaN 1 0 -2.107916e+05 0.0
9 2015-10-30 1.041849 2.384550e+06 7.744784e+07 2015 10 30 4 0 2.484341e+06 ... 1.333891e+06 NaN 1.779721e-07 NaN NaN NaN 1 0 1.477078e+05 0.0
10 2015-10-31 0.907092 6.522716e+05 6.745344e+07 2015 10 31 5 0 5.916701e+05 ... 1.459934e+06 NaN 1.477143e-07 NaN NaN NaN 1 0 -1.732278e+06 0.0
11 2015-11-01 1.058542 6.039962e+05 7.874273e+07 2015 11 1 6 0 6.393553e+05 ... 1.575586e+06 NaN 1.652301e-07 NaN NaN NaN 1 0 -4.827539e+04 0.0
12 2015-11-02 0.955046 9.706657e+05 7.106635e+07 2015 11 2 0 0 9.270302e+05 ... 1.595625e+06 NaN 1.397627e-07 NaN NaN NaN 1 0 3.666695e+05 0.0
13 2015-11-03 1.002345 1.878273e+06 7.461345e+07 2015 11 3 1 0 1.882677e+06 ... 1.628024e+06 NaN 1.308656e-07 NaN NaN NaN 1 0 9.076073e+05 0.0
14 2015-11-04 0.901809 3.218065e+06 6.715251e+07 2015 11 4 2 0 2.902079e+06 ... 1.713799e+06 -6.632193 9.937778e-08 4372.241143 289105.370958 3942.924929 1 0 1.339792e+06 0.0
15 2015-11-05 0.906601 1.202885e+06 6.753250e+07 2015 11 5 3 0 1.090537e+06 ... 1.508190e+06 0.334806 9.440279e-08 3969.137249 250059.970163 3598.424342 1 0 -2.015180e+06 0.0
16 2015-11-06 0.909442 9.136292e+05 6.776739e+07 2015 11 6 4 0 8.308929e+05 ... 1.279356e+06 0.201469 9.089579e-08 4133.197826 265723.086603 3758.904519 1 0 -2.892562e+05 0.0
17 2015-11-07 0.922482 9.046319e+05 6.876115e+07 2015 11 7 5 0 8.345067e+05 ... 1.316602e+06 0.915871 8.867329e-08 4080.096967 260618.791701 3763.816383 1 0 -8.997300e+03 0.0
18 2015-11-08 1.030500 1.040087e+06 7.683930e+07 2015 11 8 6 0 1.071809e+06 ... 1.370045e+06 7.622571 9.488461e-08 4237.709529 275865.110573 4366.957611 1 0 1.354549e+05 0.0
19 2015-11-09 0.995803 1.972521e+06 7.427755e+07 2015 11 9 0 0 1.964243e+06 ... 1.514824e+06 -2.024678 8.490811e-08 3616.057032 217446.743069 3600.880384 1 0 9.324346e+05 0.0
20 2015-11-10 0.934834 8.650315e+05 6.975349e+07 2015 11 10 1 0 8.086611e+05 ... 1.362982e+06 -3.267527 7.720529e-08 2848.199618 152004.216716 2662.594474 1 0 -1.107490e+06 0.0
21 2015-11-11 0.788761 1.243323e+06 5.887388e+07 2015 11 11 2 0 9.806847e+05 ... 1.078152e+06 -6.349221 6.232706e-08 1651.086634 67089.536649 1302.313037 1 0 3.782912e+05 0.0
22 2015-11-12 0.900742 8.268297e+05 6.725489e+07 2015 11 12 3 0 7.447603e+05 ... 1.027427e+06 5.463497 6.918774e-08 1480.964306 56992.392294 1333.966850 1 0 -4.164931e+05 0.0
23 2015-11-13 0.904082 5.450863e+05 6.752726e+07 2015 11 13 4 0 4.928024e+05 ... 9.778608e+05 0.160417 6.818871e-08 1872.153919 81005.093415 1692.579832 1 0 -2.817434e+05 0.0
24 2015-11-14 0.884229 3.615861e+05 6.606671e+07 2015 11 14 5 0 3.197249e+05 ... 9.007257e+05 -1.070312 6.590098e-08 2375.740770 115797.338042 2100.698660 1 0 -1.835001e+05 0.0
25 2015-11-15 0.910826 4.320234e+05 6.807833e+07 2015 11 15 6 0 3.934980e+05 ... 8.055660e+05 1.220412 6.693542e-08 1646.221972 66793.252323 1499.421030 1 0 7.043729e+04 0.0
26 2015-11-16 0.933841 6.163469e+05 6.981283e+07 2015 11 16 0 0 5.755699e+05 ... 6.244834e+05 1.225320 6.728649e-08 2351.466388 114027.121982 2195.895215 1 0 1.843235e+05 0.0
27 2015-11-17 0.995273 1.128250e+06 7.444073e+07 2015 11 17 1 0 1.122917e+06 ... 6.644529e+05 3.256810 6.923762e-08 2450.942894 121338.826016 2439.358164 1 0 5.119026e+05 0.0
28 2015-11-18 0.994429 6.872033e+05 7.440272e+07 2015 11 18 2 0 6.833746e+05 ... 6.120467e+05 -0.048805 6.775441e-08 3306.214627 190106.324049 3287.794316 1 0 -4.410462e+05 0.0
29 2015-11-19 0.951471 4.343855e+05 7.121230e+07 2015 11 19 3 0 4.133053e+05 ... 5.641534e+05 -2.307712 6.399463e-08 2853.885945 152459.650199 2715.390540 1 0 -2.528179e+05 0.0
30 2015-11-20 0.926803 6.065603e+05 6.938837e+07 2015 11 20 4 0 5.621620e+05 ... 5.743795e+05 -1.329536 6.123683e-08 2629.769695 134857.956504 2437.278756 1 0 1.721749e+05 0.0
31 2015-11-21 0.973572 4.549269e+05 7.291437e+07 2015 11 21 5 0 4.429043e+05 ... 5.948952e+05 2.704405 6.348780e-08 2872.078468 153919.786275 2796.176426 1 0 -1.516334e+05 0.0
32 2015-11-22 0.964739 3.719629e+05 7.227634e+07 2015 11 22 6 0 3.588469e+05 ... 5.914305e+05 -0.404938 6.224772e-08 1991.625162 88881.506412 1921.397536 1 0 -8.296404e+04 0.0
33 2015-11-23 0.945220 4.371264e+05 7.083920e+07 2015 11 23 0 0 4.131808e+05 ... 5.677350e+05 -0.925849 6.024115e-08 2094.210005 95836.367674 1979.489885 1 0 6.516351e+04 0.0
34 2015-11-24 0.899361 3.549141e+05 6.742430e+07 2015 11 24 1 0 3.191959e+05 ... 4.546287e+05 -2.277925 5.675388e-08 2204.408532 103499.469795 1982.559361 1 0 -8.221231e+04 0.0
35 2015-11-25 0.867707 7.088672e+05 6.507285e+07 2015 11 25 2 0 6.150891e+05 ... 4.488592e+05 -2.057096 5.369997e-08 3416.007675 199654.110874 2964.094085 1 0 3.539531e+05 0.0
36 2015-11-26 0.894875 9.596965e+05 6.713206e+07 2015 11 26 3 0 8.588082e+05 ... 5.143551e+05 1.541474 5.397161e-08 2423.893373 119335.667578 2169.081027 1 0 2.508293e+05 0.0
37 2015-11-27 0.870713 3.985955e+05 6.534185e+07 2015 11 27 4 0 3.470622e+05 ... 4.826661e+05 -1.238329 5.196498e-08 2103.447781 96471.182592 1831.498907 1 0 -5.611010e+05 0.0
38 2015-11-28 0.917399 4.650649e+05 6.886877e+07 2015 11 28 5 0 4.266500e+05 ... 4.797563e+05 2.887629 5.409101e-08 2900.412988 156203.140755 2660.835324 1 0 6.646944e+04 0.0
39 2015-11-29 0.871745 4.369391e+05 6.546334e+07 2015 11 29 6 0 3.808995e+05 ... 4.810518e+05 -2.271756 5.082339e-08 2083.940110 95132.267949 1816.664275 1 0 -2.812581e+04 0.0
40 2015-11-30 0.873601 7.753725e+05 6.562485e+07 2015 11 30 0 0 6.773662e+05 ... 5.183211e+05 0.091471 4.993883e-08 1845.876779 79305.637343 1612.559675 1 0 3.384334e+05 0.0
41 2015-12-01 0.875004 6.431099e+05 6.575249e+07 2015 12 1 1 0 5.627237e+05 ... 5.525786e+05 0.054061 4.922323e-08 1133.031732 38138.456233 991.407295 1 0 -1.322627e+05 0.0
42 2015-12-02 0.822734 4.942059e+05 6.184478e+07 2015 12 2 2 0 4.065999e+05 ... 5.217142e+05 -1.764262 4.572375e-08 872.243691 25760.646354 717.624350 1 0 -1.489040e+05 0.0
43 2015-12-03 0.824765 5.416811e+05 6.201881e+07 2015 12 3 3 0 4.467597e+05 ... 4.640805e+05 0.082131 4.523775e-08 1106.462467 36804.848095 912.571725 1 0 4.747524e+04 0.0
44 2015-12-04 0.838791 2.419475e+05 6.309433e+07 2015 12 4 4 0 2.029434e+05 ... 4.423760e+05 0.646870 4.574011e-08 1446.909749 55037.939230 1213.654805 1 0 -2.997336e+05 0.0
45 2015-12-05 0.864584 2.260844e+05 6.505634e+07 2015 12 5 5 0 1.954689e+05 ... 4.093750e+05 1.054467 4.689238e-08 1175.919488 40324.257731 1016.680857 1 0 -1.586308e+04 0.0
46 2015-12-06 0.834992 4.296813e+05 6.285026e+07 2015 12 6 6 0 3.587806e+05 ... 4.059763e+05 -1.107406 4.482802e-08 1046.891522 33872.911238 874.146555 1 0 2.035969e+05 0.0
47 2015-12-07 0.800750 4.946576e+05 6.029248e+07 2015 12 7 0 0 3.960972e+05 ... 3.674121e+05 -1.275784 4.249339e-08 967.823226 30108.842548 774.984723 1 0 6.497633e+04 0.0
48 2015-12-08 0.818853 4.359668e+05 6.167586e+07 2015 12 8 1 0 3.569926e+05 ... 3.393504e+05 0.873737 4.301638e-08 1493.762873 57732.782770 1223.171775 1 0 -5.869081e+04 0.0
49 2015-12-09 0.791829 6.271289e+05 5.966107e+07 2015 12 9 2 0 4.965791e+05 ... 3.532086e+05 -1.292357 4.100272e-08 1533.549977 60054.686047 1214.309879 1 0 1.911621e+05 0.0

50 rows × 44 columns

Advanced Feature Engineering for Ethereum Price Prediction¶

This section of the code demonstrates the creation of various advanced features derived from the basic columns in the ethereum_data DataFrame. These features aim to capture trends, cyclic behavior, and other complex relationships within the data that could be useful for predicting Ethereum prices.

```python

Creating rolling averages for price and volume over a 7-day window to smooth out short-term fluctuations¶

ed['price_7day_avg'] = ed['Price'].rolling(window=7).mean() ed['volume_7day_avg'] = ed['Volume'].rolling(window=7).mean()

Calculating the percentage change in price and volume to capture momentum¶

ed['price_change_pct'] = ed['Price'].pct_change() 100 ed['volume_change_pct'] = ed['Volume'].pct_change() 100

Ratio of market capitalization to volume to understand the trading activity relative to the size of the market¶

ed['market_cap_volume_ratio'] = ed['Market Cap'] / ed['Volume']

Lag features to capture previous day's values which might indicate persistence in price and volume¶

ed['price_lag1'] = ed['Price'].shift(1) ed['volume_lag1'] = ed['Volume'].shift(1)

Exponential moving averages for price to identify long-term trends versus short-term trends¶

ed['price_ema_short'] = ed['Price'].ewm(span=12, adjust=False).mean()
ed['price_ema_long'] = ed['Price'].ewm(span=26, adjust=False).mean()

Computing the Relative Strength Index (RSI), a popular momentum indicator¶

delta = ed['Price'].diff() up, down = delta.copy(), delta.copy() up[up < 0] = 0 down[down > 0] = 0 ed['rsi'] = 100 - (100 / (1 + up.rolling(window=14).mean() / down.abs().rolling(window=14).mean()))

Extracting time-based features to capture seasonal effects¶

ed['week_of_year'] = pd.to_datetime(ed['date']).dt.isocalendar().week ed['quarter'] = pd.to_datetime(ed['date']).dt.quarter ed['days_since_launch'] = (pd.to_datetime(ed['date']) - pd.to_datetime(ed['date']).min()).dt.days

Cumulative measures to see overall growth trends¶

ed['cumulative_return'] = (1 + ed['Price'].pct_change()).cumprod() ed['cumulative_volume'] = ed['Volume'].cumsum()

Price to moving average ratio to understand price position relative to recent trends¶

ed['price_ma_ratio'] = ed['Price'] / ed['Price'].rolling(window=20).mean()

Normalizing price by dividing by the maximum price to make the range between 0 and 1¶

ed['normalized_price'] = ed['Price'] / ed['Price'].max()

print(ethereum_data.columns)

LTSM Model

In [23]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
ethereum_data['date'] = pd.to_datetime(ethereum_data['date'])
ethereum_data.sort_values('date', inplace=True)
features = ethereum_data[['Price', 'Volume', 'Market Cap']]
target = ethereum_data['Price']
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_features = scaler.fit_transform(features)
scaled_target = scaler.fit_transform(target.values.reshape(-1, 1))

def create_dataset(X, y, time_step=1):
    Xs, ys = [], []
    for i in range(len(X) - time_step):
        v = X[i:(i + time_step)]
        Xs.append(v)
        ys.append(y[i + time_step])
    return np.array(Xs), np.array(ys)

time_step = 10
X, y = create_dataset(scaled_features, scaled_target, time_step)
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
lstm_units = 50
dropout_rate = 0.2

model = Sequential()
model.add(LSTM(units=lstm_units, return_sequences=True, input_shape=(time_step, X.shape[2])))
model.add(LSTM(units=lstm_units))
model.add(Dropout(rate=dropout_rate))
model.add(Dense(units=1))

model.compile(loss='mean_squared_error', optimizer='adam')

model.summary()
C:\Users\Luke Holmes\anaconda3\Lib\site-packages\keras\src\layers\rnn\rnn.py:204: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
  super().__init__(**kwargs)
Model: "sequential_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ lstm_2 (LSTM)                   │ (None, 10, 50)         │        10,800 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ lstm_3 (LSTM)                   │ (None, 50)             │        20,200 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_1 (Dropout)             │ (None, 50)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 1)              │            51 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 31,051 (121.29 KB)
 Trainable params: 31,051 (121.29 KB)
 Non-trainable params: 0 (0.00 B)

numpy and pandas are used for data manipulation. MinMaxScaler from sklearn.preprocessing is used to scale the data, which helps in normalizing the input features/labels within a bounded range and is generally a good practice for neural network algorithms. Sequential from tensorflow.keras.models and layers like LSTM, Dense, and Dropout from tensorflow.keras.layers are used to build the LSTM model.

In [ ]:
ethereum_data['date'] = pd.to_datetime(ethereum_data['date'])
ethereum_data.sort_values('date', inplace=True)
features = ethereum_data[['Price', 'Volume', 'Market Cap']]
target = ethereum_data['Price']
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_features = scaler.fit_transform(features)
scaled_target = scaler.fit_transform(target.values.reshape(-1, 1))

Converts the 'date' column to a datetime object and sorts the DataFrame based on date to ensure that the sequence is in chronological order. Extracts relevant features and the target variable ('Price') for model training. Scales the features and target variable between 0 and 1 to facilitate faster convergence during training.

In [24]:
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

model = Sequential()
model.add(LSTM(50, return_sequences=True, input_shape=(time_step, X.shape[2])))
model.add(LSTM(50))
model.add(Dropout(0.2))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')

model.summary()
Model: "sequential_2"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ lstm_4 (LSTM)                   │ (None, 10, 50)         │        10,800 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ lstm_5 (LSTM)                   │ (None, 50)             │        20,200 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_2 (Dropout)             │ (None, 50)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_2 (Dense)                 │ (None, 1)              │            51 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 31,051 (121.29 KB)
 Trainable params: 31,051 (121.29 KB)
 Non-trainable params: 0 (0.00 B)

Splits the data into training and testing sets. Constructs a Sequential LSTM model with two LSTM layers to capture complex relationships in the data. A Dropout layer is included to prevent overfitting, followed by a Dense output layer to predict the continuous value (Ethereum price). The model uses 'mean_squared_error' as the loss function and 'adam' optimizer, which is commonly used in regression problems like this.

In [25]:
from optuna import Trial
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dropout, Dense, Input
from tensorflow.keras.optimizers import Adam

def build_model(trial: Trial):
    tf.keras.backend.clear_session()
    model = Sequential([
        Input(shape=(time_step, X.shape[2])),
        LSTM(trial.suggest_categorical('lstm_units', [50, 100, 150]), return_sequences=True),
        LSTM(trial.suggest_categorical('lstm_units', [50, 100, 150])),
        Dropout(trial.suggest_float('dropout_rate', 0.1, 0.5)),
        Dense(1)
    ])
    lr = trial.suggest_float('lr', 1e-5, 1e-1, log=True)
    model.compile(optimizer=Adam(learning_rate=lr), loss='mean_squared_error')
    return model

def objective(trial: Trial):
    model = build_model(trial)
    history = model.fit(X_train, y_train, validation_split=0.2, epochs=50, batch_size=64, verbose=0)
    loss = model.evaluate(X_test, y_test, verbose=0)
    return loss

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=50, timeout=600)

print("Best trial:")
trial = study.best_trial
print(f"Value: {trial.value}")
print("Params: ")
for key, value in trial.params.items():
    print(f"{key}: {value}")
[I 2024-05-01 22:21:15,208] A new study created in memory with name: no-name-ecf34e47-17a2-40ab-a0e6-17ff04acab20
[I 2024-05-01 22:22:32,664] Trial 0 finished with value: 0.0007128661382012069 and parameters: {'lstm_units': 150, 'dropout_rate': 0.3900353930656907, 'lr': 0.006665058671399883}. Best is trial 0 with value: 0.0007128661382012069.
[I 2024-05-01 22:23:38,356] Trial 1 finished with value: 0.0005467506707645953 and parameters: {'lstm_units': 100, 'dropout_rate': 0.48366893947431777, 'lr': 0.00017547692320694214}. Best is trial 1 with value: 0.0005467506707645953.
[I 2024-05-01 22:24:22,317] Trial 2 finished with value: 0.02268524281680584 and parameters: {'lstm_units': 100, 'dropout_rate': 0.12746841967209088, 'lr': 0.04212107130866591}. Best is trial 1 with value: 0.0005467506707645953.
[I 2024-05-01 22:25:17,107] Trial 3 finished with value: 0.002532100537791848 and parameters: {'lstm_units': 100, 'dropout_rate': 0.1736868828938726, 'lr': 1.1849615328748276e-05}. Best is trial 1 with value: 0.0005467506707645953.
[I 2024-05-01 22:26:05,849] Trial 4 finished with value: 0.0018641131464391947 and parameters: {'lstm_units': 100, 'dropout_rate': 0.1397727728687151, 'lr': 2.2219403053770703e-05}. Best is trial 1 with value: 0.0005467506707645953.
[I 2024-05-01 22:26:52,823] Trial 5 finished with value: 0.0005035395734012127 and parameters: {'lstm_units': 100, 'dropout_rate': 0.3717713526400662, 'lr': 0.0003587859917045765}. Best is trial 5 with value: 0.0005035395734012127.
[I 2024-05-01 22:27:14,318] Trial 6 finished with value: 0.013612092472612858 and parameters: {'lstm_units': 50, 'dropout_rate': 0.19570302507032103, 'lr': 0.054390304448723954}. Best is trial 5 with value: 0.0005035395734012127.
[I 2024-05-01 22:27:35,364] Trial 7 finished with value: 0.0006943660555407405 and parameters: {'lstm_units': 50, 'dropout_rate': 0.42498911739570355, 'lr': 0.00024626257778208375}. Best is trial 5 with value: 0.0005035395734012127.
[I 2024-05-01 22:28:49,605] Trial 8 finished with value: 0.00172930839471519 and parameters: {'lstm_units': 150, 'dropout_rate': 0.12686054201625507, 'lr': 1.1743802485863199e-05}. Best is trial 5 with value: 0.0005035395734012127.
[I 2024-05-01 22:29:30,586] Trial 9 finished with value: 0.0008990837959572673 and parameters: {'lstm_units': 100, 'dropout_rate': 0.482405133666325, 'lr': 6.92196342074409e-05}. Best is trial 5 with value: 0.0005035395734012127.
[I 2024-05-01 22:29:53,351] Trial 10 finished with value: 0.0006328715244308114 and parameters: {'lstm_units': 50, 'dropout_rate': 0.3017412034512606, 'lr': 0.001807707131274098}. Best is trial 5 with value: 0.0005035395734012127.
[I 2024-05-01 22:30:39,398] Trial 11 finished with value: 0.000498659152071923 and parameters: {'lstm_units': 100, 'dropout_rate': 0.3452337125154388, 'lr': 0.0006220505444542745}. Best is trial 11 with value: 0.000498659152071923.
[I 2024-05-01 22:31:24,501] Trial 12 finished with value: 0.0009540743776597083 and parameters: {'lstm_units': 100, 'dropout_rate': 0.32659224940013326, 'lr': 0.0008798325039893869}. Best is trial 11 with value: 0.000498659152071923.
Best trial:
Value: 0.000498659152071923
Params: 
lstm_units: 100
dropout_rate: 0.3452337125154388
lr: 0.0006220505444542745

Optuna is a hyperparameter optimization library to automate hyperparameter search. TensorFlow and its Keras API are used to build and train the LSTM model. Adam optimizer is commonly used due to its efficiency in handling sparse gradients on noisy problems. The build_model function constructs a neural network based on the hyperparameters suggested by Optuna during the trials. It clears any existing TensorFlow backend session to ensure that the model starts with a clean state on each trial. The LSTM layers are dynamically configured to have units based on trial suggestions, which can significantly impact learning capacity and performance. Dropout is used to randomly set a fraction of input units to 0 at each update during training time, which helps to prevent overfitting.

Defining the Objective Function

The objective function orchestrates the training and evaluation process. It takes a trial object which provides suggestions for the hyperparameters and returns the loss of the model on the test set, which Optuna tries to minimize.

Running the Optimization

A study object is created with the goal to 'minimize' the loss. Optuna supports both minimization and maximization. The optimize method of the study object runs the optimization for a defined number of trials (n_trials) or until a certain time limit is reached (timeout).

Displaying the Best Trial

After the optimization, the best trial is accessed via study.best_trial. It displays the best performance observed during the optimization and the hyperparameters that led to that performance.

Conclusion

Using Optuna for hyperparameter optimization can lead to significant improvements in model performance by systematically searching through multiple combinations of hyperparameter settings. This process is crucial for fine-tuning deep learning models and can be particularly beneficial in complex domains such as financial time series forecasting.

In [26]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50, batch_size=64, verbose=1)
Epoch 1/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 4s 19ms/step - loss: 0.0348 - val_loss: 0.0016
Epoch 2/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 0.0017 - val_loss: 6.6941e-04
Epoch 3/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 0.0016 - val_loss: 0.0011
Epoch 4/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 0.0012 - val_loss: 9.7776e-04
Epoch 5/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 0.0013 - val_loss: 8.9295e-04
Epoch 6/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 13ms/step - loss: 0.0013 - val_loss: 6.4354e-04
Epoch 7/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 12ms/step - loss: 0.0013 - val_loss: 5.5434e-04
Epoch 8/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - loss: 0.0012 - val_loss: 9.5105e-04
Epoch 9/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 2s 44ms/step - loss: 0.0011 - val_loss: 0.0021
Epoch 10/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 3s 51ms/step - loss: 0.0013 - val_loss: 5.7144e-04
Epoch 11/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - loss: 0.0011 - val_loss: 0.0011
Epoch 12/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - loss: 9.1258e-04 - val_loss: 6.4081e-04
Epoch 13/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - loss: 0.0010 - val_loss: 6.6901e-04
Epoch 14/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - loss: 0.0011 - val_loss: 6.2649e-04
Epoch 15/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 9.9424e-04 - val_loss: 0.0016
Epoch 16/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - loss: 0.0011 - val_loss: 4.5030e-04
Epoch 17/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 0.0011 - val_loss: 0.0017
Epoch 18/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 9.3172e-04 - val_loss: 9.9930e-04
Epoch 19/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 9.7901e-04 - val_loss: 4.6408e-04
Epoch 20/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 9.0177e-04 - val_loss: 7.9490e-04
Epoch 21/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - loss: 9.5421e-04 - val_loss: 8.0474e-04
Epoch 22/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 0.0011 - val_loss: 5.7729e-04
Epoch 23/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 7.0903e-04 - val_loss: 0.0017
Epoch 24/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 8.8002e-04 - val_loss: 0.0017
Epoch 25/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 8.5928e-04 - val_loss: 8.5455e-04
Epoch 26/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 7.9640e-04 - val_loss: 0.0013
Epoch 27/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step - loss: 7.4214e-04 - val_loss: 3.7975e-04
Epoch 28/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 12ms/step - loss: 8.6599e-04 - val_loss: 6.0650e-04
Epoch 29/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step - loss: 8.6641e-04 - val_loss: 0.0013
Epoch 30/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - loss: 8.0409e-04 - val_loss: 0.0011
Epoch 31/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - loss: 7.2854e-04 - val_loss: 3.1957e-04
Epoch 32/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 2s 49ms/step - loss: 7.3699e-04 - val_loss: 7.3082e-04
Epoch 33/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 13ms/step - loss: 6.1679e-04 - val_loss: 3.6806e-04
Epoch 34/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - loss: 7.9721e-04 - val_loss: 3.0742e-04
Epoch 35/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 39ms/step - loss: 7.5590e-04 - val_loss: 2.9894e-04
Epoch 36/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 12ms/step - loss: 6.8833e-04 - val_loss: 7.4489e-04
Epoch 37/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 12ms/step - loss: 7.0217e-04 - val_loss: 0.0011
Epoch 38/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 13ms/step - loss: 7.3997e-04 - val_loss: 0.0011
Epoch 39/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - loss: 7.8344e-04 - val_loss: 0.0013
Epoch 40/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 11ms/step - loss: 8.2099e-04 - val_loss: 3.4394e-04
Epoch 41/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 8.1899e-04 - val_loss: 2.7239e-04
Epoch 42/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step - loss: 7.4848e-04 - val_loss: 2.9717e-04
Epoch 43/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 7.0904e-04 - val_loss: 3.7375e-04
Epoch 44/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 7.2711e-04 - val_loss: 3.4271e-04
Epoch 45/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 7.9872e-04 - val_loss: 5.4607e-04
Epoch 46/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step - loss: 7.8388e-04 - val_loss: 3.1085e-04
Epoch 47/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step - loss: 8.0666e-04 - val_loss: 0.0010
Epoch 48/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 5.6177e-04 - val_loss: 2.8615e-04
Epoch 49/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step - loss: 7.2094e-04 - val_loss: 5.3703e-04
Epoch 50/50
37/37 ━━━━━━━━━━━━━━━━━━━━ 0s 11ms/step - loss: 5.6650e-04 - val_loss: 5.6847e-04
Out[26]:
<keras.src.callbacks.history.History at 0x23599032e90>
In [27]:
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.models import Sequential
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
import numpy as np

def train_and_evaluate_model(X_train, y_train, X_test, y_test, lstm_units, num_layers, epochs=20, batch_size=64):
    model = Sequential()
    model.add(LSTM(lstm_units, return_sequences=(num_layers > 1), input_shape=(X_train.shape[1], X_train.shape[2])))
    for i in range(num_layers - 1):
        model.add(LSTM(lstm_units, return_sequences=(i < num_layers - 2)))
    model.add(Dropout(0.2))
    model.add(Dense(1))
    model.compile(loss='mean_squared_error', optimizer='adam')

    history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_test, y_test), verbose=0, shuffle=False)

    train_predict = model.predict(X_train)
    test_predict = model.predict(X_test)

    train_rmse = np.sqrt(mean_squared_error(y_train, train_predict))
    test_rmse = np.sqrt(mean_squared_error(y_test, test_predict))

    plt.plot(history.history['loss'], label='train')
    plt.plot(history.history['val_loss'], label='test')
    plt.title(f'Training and Validation loss (Units: {lstm_units}, Layers: {num_layers})')
   
In [28]:
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
import numpy as np

def train_and_evaluate_model(X_train, y_train, X_test, y_test, lstm_units, dropout_rate, learning_rate, epochs=50, batch_size=64):
    model = Sequential()
    model.add(LSTM(lstm_units, return_sequences=True, input_shape=(X_train.shape[1], X_train.shape[2])))
    model.add(Dropout(dropout_rate))
    model.add(LSTM(lstm_units))  
    model.add(Dropout(dropout_rate))
    model.add(Dense(1))
    optimizer = Adam(learning_rate=learning_rate)
    model.compile(loss='mean_squared_error', optimizer=optimizer)

    history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_test, y_test), verbose=1, shuffle=False)

    train_predict = model.predict(X_train)
    test_predict = model.predict(X_test)

    train_rmse = np.sqrt(mean_squared_error(y_train, train_predict))
    test_rmse = np.sqrt(mean_squared_error(y_test, test_predict))

    plt.plot(history.history['loss'], label='Train Loss')
    plt.plot(history.history['val_loss'], label='Test Loss')
    plt.title('Training and Validation Loss')
    plt.legend()
    plt.show()

    return train_rmse, test_rmse

lstm_units = 100  # From Optuna
dropout_rate = 0.3452337125154388  # From Optuna
learning_rate = 0.0006220505444542745  # From Optuna

train_rmse, test_rmse = train_and_evaluate_model(
    X_train, y_train, X_test, y_test, 
    lstm_units=lstm_units, 
    dropout_rate=dropout_rate, 
    learning_rate=learning_rate
)
print(f"Train RMSE: {train_rmse}, Test RMSE: {test_rmse}")
74/74 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step
19/19 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step 
Units: 50, Layers: 1 => Train RMSE: 0.03512533177352272, Test RMSE: 0.020643031084390882
C:\Users\Luke Holmes\anaconda3\Lib\site-packages\keras\src\layers\rnn\rnn.py:204: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
  super().__init__(**kwargs)
74/74 ━━━━━━━━━━━━━━━━━━━━ 1s 9ms/step
19/19 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
Units: 50, Layers: 2 => Train RMSE: 0.0637469685834429, Test RMSE: 0.028491347000281594
C:\Users\Luke Holmes\anaconda3\Lib\site-packages\keras\src\layers\rnn\rnn.py:204: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
  super().__init__(**kwargs)
74/74 ━━━━━━━━━━━━━━━━━━━━ 1s 9ms/step
19/19 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
Units: 50, Layers: 3 => Train RMSE: 0.046748635757175284, Test RMSE: 0.026319640058699755
C:\Users\Luke Holmes\anaconda3\Lib\site-packages\keras\src\layers\rnn\rnn.py:204: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
  super().__init__(**kwargs)
74/74 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step
19/19 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step 
Units: 100, Layers: 1 => Train RMSE: 0.019572349799368467, Test RMSE: 0.021077017916277595
C:\Users\Luke Holmes\anaconda3\Lib\site-packages\keras\src\layers\rnn\rnn.py:204: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
  super().__init__(**kwargs)
74/74 ━━━━━━━━━━━━━━━━━━━━ 1s 10ms/step
19/19 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step
Units: 100, Layers: 2 => Train RMSE: 0.07661178268558169, Test RMSE: 0.023474108356920143
C:\Users\Luke Holmes\anaconda3\Lib\site-packages\keras\src\layers\rnn\rnn.py:204: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
  super().__init__(**kwargs)
74/74 ━━━━━━━━━━━━━━━━━━━━ 1s 13ms/step
19/19 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step
Units: 100, Layers: 3 => Train RMSE: 0.04039341945663431, Test RMSE: 0.034419499330793774

Explanation: Model Architecture: This setup uses a two-layer LSTM model with dropout after each LSTM layer. The return_sequences=True parameter in the first LSTM layer allows for sequence generation necessary for the following LSTM layer input. Optimizer Configuration: Uses the Adam optimizer with a learning rate specified by Optuna's results. Training and Evaluation: The model is trained using the given training set and evaluated on the test set, with a verbose output to track progress visually during training. Visualization: Plots the training and test loss over epochs to visualize the model's learning progress.

Correlation Analysis¶

Next, let's conduct a correlation analysis to identify potential linear relationships between different features within the Ethereum dataset. This can help in understanding the dependencies between different variables.

Importing Additional Libraries¶

We'll use seaborn for the heatmap and numpy for numerical operations.

In [29]:
train_predict = model.predict(X_train)
test_predict = model.predict(X_test)

train_predict = scaler.inverse_transform(train_predict)
test_predict = scaler.inverse_transform(test_predict)
actual_y_train = scaler.inverse_transform(y_train)
actual_y_test = scaler.inverse_transform(y_test)

import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
plt.plot(actual_y_test, label='Actual Price')
plt.plot(test_predict, label='Predicted Price', alpha=0.7)
plt.title('Ethereum Price Prediction')
plt.xlabel('Time')
plt.ylabel('Ethereum Price')
plt.legend()
plt.show()
74/74 ━━━━━━━━━━━━━━━━━━━━ 1s 7ms/step
19/19 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step

Correlation Analysis¶

Next, let's conduct a correlation analysis to identify potential linear relationships between different features within the Ethereum dataset. This can help in understanding the dependencies between different variables.

Importing Additional Libraries¶

We'll use seaborn for the heatmap and numpy for numerical operations.

In [30]:
import seaborn as sns
import matplotlib.pyplot as plt

corr = ethereum_data.corr()

mask = np.triu(np.ones_like(corr, dtype=bool))

plt.figure(figsize=(15, 10))

sns.heatmap(corr, mask=mask, annot=True, fmt=".2f", cmap='coolwarm', cbar_kws={"shrink": .5})

plt.title('Correlation Matrix Heatmap')
plt.show()
C:\Users\Luke Holmes\AppData\Local\Temp\ipykernel_2688\1223983659.py:4: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  corr = ethereum_data.corr()
In [31]:
import plotly.graph_objects as go

corr = ethereum_data.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))

corr_masked = corr.where(~mask, None)

fig = go.Figure(data=go.Heatmap(
    z=corr_masked,
    x=corr.columns,
    y=corr.columns,
    colorscale='RdBu',
    zmin=-1,  
    zmax=1
))

fig.update_layout(
    title='Correlation Matrix Heatmap',
    width=800,
    height=800,
    xaxis_showgrid=False,
    yaxis_showgrid=False,
    yaxis_autorange='reversed'
)

fig.show()
C:\Users\Luke Holmes\AppData\Local\Temp\ipykernel_2688\3844861659.py:3: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  corr = ethereum_data.corr()

Handling Missing Values and Outliers in Ethereum Dataset¶

Identifying and Filling Missing Values¶

Initially, we identify any missing values across each column of the dataset. Then, we fill these missing values with the median of their respective columns, which is a common practice to avoid bias caused by outliers.

In [32]:
print("NaN counts in each column:\n", ethereum_data.isnull().sum())
ethereum_data.fillna(ethereum_data.median(), inplace=True)
print("NaN counts after handling:\n", ethereum_data.isnull().sum())
NaN counts in each column:
 date                          0
Price                         0
Volume                        0
Market Cap                    0
year                          0
month                         0
day                           0
weekday                       0
hour                          0
price_volume_interaction      0
marketcap_volume_ratio        0
price_change                  1
volume_change                 1
price_7day_avg                6
volume_7day_avg               6
price_change_pct              1
volume_change_pct             1
market_cap_volume_ratio       0
price_lag1                    1
volume_lag1                   1
price_ema_short               0
price_ema_long                0
rsi                          14
week_of_year                  0
quarter                       0
days_since_launch             0
cumulative_return             1
cumulative_volume             0
price_ma_ratio               19
normalized_price              0
price_x_volume                0
marketcap_per_volume          0
price_squared                 0
price_change_pct_x_volume     1
ma_price_x_ma_volume          6
rsi_x_price_change_pct       14
return_volume_ratio           1
rsi_squared                  14
rsi_cubed                    14
rsi_squared_x_price          14
is_Q4                         0
is_start_of_year              0
Q4_volume_change              1
high_volume_price_change      0
dtype: int64
NaN counts after handling:
 date                         0
Price                        0
Volume                       0
Market Cap                   0
year                         0
month                        0
day                          0
weekday                      0
hour                         0
price_volume_interaction     0
marketcap_volume_ratio       0
price_change                 0
volume_change                0
price_7day_avg               0
volume_7day_avg              0
price_change_pct             0
volume_change_pct            0
market_cap_volume_ratio      0
price_lag1                   0
volume_lag1                  0
price_ema_short              0
price_ema_long               0
rsi                          0
week_of_year                 0
quarter                      0
days_since_launch            0
cumulative_return            0
cumulative_volume            0
price_ma_ratio               0
normalized_price             0
price_x_volume               0
marketcap_per_volume         0
price_squared                0
price_change_pct_x_volume    0
ma_price_x_ma_volume         0
rsi_x_price_change_pct       0
return_volume_ratio          0
rsi_squared                  0
rsi_cubed                    0
rsi_squared_x_price          0
is_Q4                        0
is_start_of_year             0
Q4_volume_change             0
high_volume_price_change     0
dtype: int64
C:\Users\Luke Holmes\AppData\Local\Temp\ipykernel_2688\2861788634.py:2: FutureWarning:

DataFrame.mean and DataFrame.median with numeric_only=None will include datetime64 and datetime64tz columns in a future version.

C:\Users\Luke Holmes\AppData\Local\Temp\ipykernel_2688\2861788634.py:2: DeprecationWarning:

In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`

Removing Columns with Zero Variance¶

Columns with zero variance (i.e., the same value across all entries) do not contribute to the model's predictive capability and are therefore removed from the dataset.

In [33]:
zero_var_cols = ethereum_data.columns[ethereum_data.nunique() <= 1]
print("Columns with zero variance:", zero_var_cols)
ethereum_data.drop(columns=zero_var_cols, inplace=True)
Columns with zero variance: Index(['hour'], dtype='object')

Handling Infinite Values and Recalculating Z-Scores¶

Next, we replace infinite values within our numeric columns with NaN, which allows for more consistent data filling. Following this, we refill these values using the median. We then attempt to calculate the Z-scores to identify outliers, defined as those beyond 3 standard deviations from the mean.

In [34]:
import numpy as np
from scipy import stats

numeric_cols = ethereum_data.select_dtypes(include=[np.number]).columns
ethereum_data[numeric_cols] = ethereum_data[numeric_cols].replace([np.inf, -np.inf], np.nan)

ethereum_data[numeric_cols] = ethereum_data[numeric_cols].fillna(ethereum_data[numeric_cols].median())

try:
    z_scores = np.abs(stats.zscore(ethereum_data[numeric_cols]))
    outliers_z = (z_scores > 3).any(axis=1)
    print("Detected outliers by Z-Score:", ethereum_data[outliers_z].shape[0])
except Exception as e:
    print("Error recalculating Z-scores:", str(e))

for column in numeric_cols:
    try:
        z_score = np.abs(stats.zscore(ethereum_data[column].dropna()))
        outlier = (z_score > 3)
        print(f"Outliers in {column}: {outlier.sum()}")
    except Exception as e:
        print(f"Error processing column {column}: {str(e)}")
Error recalculating Z-scores: loop of ufunc does not support argument 0 of type float which has no callable sqrt method
Outliers in Price: 37
Outliers in Volume: 48
Outliers in Market Cap: 36
Outliers in year: 0
Outliers in month: 0
Outliers in day: 0
Outliers in weekday: 0
Outliers in price_volume_interaction: 49
Outliers in marketcap_volume_ratio: 3
Outliers in price_change: 72
Outliers in volume_change: 42
Outliers in price_7day_avg: 43
Outliers in volume_7day_avg: 45
Outliers in price_change_pct: 50
Outliers in volume_change_pct: 3
Outliers in market_cap_volume_ratio: 3
Outliers in price_lag1: 37
Outliers in volume_lag1: 48
Outliers in price_ema_short: 41
Outliers in price_ema_long: 40
Outliers in rsi: 0
Error processing column week_of_year: loop of ufunc does not support argument 0 of type float which has no callable sqrt method
Outliers in quarter: 0
Outliers in days_since_launch: 0
Outliers in cumulative_return: 37
Outliers in cumulative_volume: 0
Outliers in price_ma_ratio: 49
Outliers in normalized_price: 37
Outliers in price_x_volume: 49
Outliers in marketcap_per_volume: 3
Outliers in price_squared: 94
Outliers in price_change_pct_x_volume: 41
Outliers in ma_price_x_ma_volume: 40
Outliers in rsi_x_price_change_pct: 67
Outliers in return_volume_ratio: 18
Outliers in rsi_squared: 13
Outliers in rsi_cubed: 38
Outliers in rsi_squared_x_price: 94
Outliers in is_Q4: 0
Outliers in is_start_of_year: 248
Outliers in Q4_volume_change: 61
Outliers in high_volume_price_change: 75

Detailed Outlier Detection per Column¶

Finally, we analyze outliers in each numeric column individually to understand the distribution and identify extreme values that could affect the performance of predictive models.

In [35]:
import numpy as np
from scipy import stats
from scipy.stats.mstats import winsorize

numeric_cols = ethereum_data.select_dtypes(include=[np.number])
numeric_cols.replace([np.inf, -np.inf], np.nan, inplace=True)
numeric_cols.dropna(inplace=True)

std_devs = numeric_cols.std()
print("Standard deviations of numeric columns:", std_devs)

valid_cols = std_devs[std_devs > 0].index
valid_numeric_data = numeric_cols[valid_cols].astype(float)

try:
    z_scores = np.abs(stats.zscore(valid_numeric_data, nan_policy='omit'))
    outliers_z = (z_scores > 3).any(axis=1)
    print("Detected outliers by Z-Score without error:", outliers_z.sum())
except Exception as e:
    print("Error recalculating Z-scores:", str(e))
Standard deviations of numeric columns: Price                        1.091812e+03
Volume                       1.227600e+10
Market Cap                   1.304226e+11
year                         2.351049e+00
month                        3.462797e+00
day                          8.810793e+00
weekday                      1.999072e+00
price_volume_interaction     3.323323e+13
marketcap_volume_ratio       1.436267e+02
price_change                 6.456290e+01
volume_change                5.630875e+09
price_7day_avg               1.089073e+03
volume_7day_avg              1.152035e+10
price_change_pct             5.477677e-02
volume_change_pct            3.052744e+02
market_cap_volume_ratio      1.436267e+02
price_lag1                   1.091659e+03
volume_lag1                  1.227466e+10
price_ema_short              1.086028e+03
price_ema_long               1.079339e+03
rsi                          1.824692e+01
week_of_year                 1.514276e+01
quarter                      1.123532e+00
days_since_launch            8.560315e+02
cumulative_return            2.482507e+03
cumulative_volume            1.020081e+13
price_ma_ratio               1.544854e-01
normalized_price             2.267520e-01
price_x_volume               3.323323e+13
marketcap_per_volume         1.436267e+02
price_squared                3.854817e+06
price_change_pct_x_volume    1.326119e+09
ma_price_x_ma_volume         3.221227e+13
rsi_x_price_change_pct       3.221176e+00
return_volume_ratio          2.675421e-08
rsi_squared                  1.986515e+03
rsi_cubed                    1.819791e+05
rsi_squared_x_price          4.542350e+06
is_Q4                        4.402400e-01
is_start_of_year             2.769401e-01
Q4_volume_change             1.868019e+09
high_volume_price_change     6.274193e+01
dtype: float64
Detected outliers by Z-Score without error: 613
In [36]:
print(ethereum_data.select_dtypes(include=[np.number]).dtypes)
Price                        float64
Volume                       float64
Market Cap                   float64
year                           int64
month                          int64
day                            int64
weekday                        int64
price_volume_interaction     float64
marketcap_volume_ratio       float64
price_change                 float64
volume_change                float64
price_7day_avg               float64
volume_7day_avg              float64
price_change_pct             float64
volume_change_pct            float64
market_cap_volume_ratio      float64
price_lag1                   float64
volume_lag1                  float64
price_ema_short              float64
price_ema_long               float64
rsi                          float64
week_of_year                  UInt32
quarter                        int64
days_since_launch              int64
cumulative_return            float64
cumulative_volume            float64
price_ma_ratio               float64
normalized_price             float64
price_x_volume               float64
marketcap_per_volume         float64
price_squared                float64
price_change_pct_x_volume    float64
ma_price_x_ma_volume         float64
rsi_x_price_change_pct       float64
return_volume_ratio          float64
rsi_squared                  float64
rsi_cubed                    float64
rsi_squared_x_price          float64
is_Q4                          int64
is_start_of_year               int64
Q4_volume_change             float64
high_volume_price_change     float64
dtype: object
In [37]:
ethereum_data.replace([np.inf, -np.inf], np.nan, inplace=True)
ethereum_data.fillna(ethereum_data.median(), inplace=True) 
C:\Users\Luke Holmes\AppData\Local\Temp\ipykernel_2688\3595528556.py:2: FutureWarning:

DataFrame.mean and DataFrame.median with numeric_only=None will include datetime64 and datetime64tz columns in a future version.

C:\Users\Luke Holmes\AppData\Local\Temp\ipykernel_2688\3595528556.py:2: DeprecationWarning:

In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`

Outlier Detection and Removal We utilize Z-scores to identify outliers. Here, outliers are defined as observations that are more than 3 standard deviations away from the mean.

In [38]:
import numpy as np
from scipy import stats
from scipy.stats.mstats import winsorize

numeric_cols = ethereum_data.select_dtypes(include=[np.number])
numeric_cols.replace([np.inf, -np.inf], np.nan, inplace=True)
numeric_cols.dropna(inplace=True)

std_devs = numeric_cols.std()
constant_cols = std_devs[std_devs == 0].index
valid_cols = std_devs[std_devs > 0].index

if not constant_cols.empty:
    ethereum_data.drop(columns=constant_cols, inplace=True)
    print("Dropped constant columns:", constant_cols)

valid_numeric_data = numeric_cols[valid_cols].astype(float)
try:
    z_scores = np.abs(stats.zscore(valid_numeric_data, nan_policy='omit'))
    outliers_z = (z_scores > 3).any(axis=1)
    print("Detected outliers by Z-Score:", outliers_z.sum())
except Exception as e:
    print("Error recalculating Z-scores:", str(e))
Detected outliers by Z-Score: 613

Winsorizing Data

To further mitigate the effect of extreme values, we apply winsorization to cap extreme values at both ends of the distribution.

In [39]:
filtered_data = valid_numeric_data[~outliers_z]
print(f"Data after removing outliers has {filtered_data.shape[0]} records out of {valid_numeric_data.shape[0]} original records.")
from scipy.stats.mstats import winsorize
winsorized_data = valid_numeric_data.apply(lambda x: winsorize(x, limits=[0.05, 0.05]))
transformed_data = valid_numeric_data.copy()
transformed_data = transformed_data.apply(lambda x: np.log(x + 1) if np.all(x > 0) else x)
Data after removing outliers has 2351 records out of 2964 original records.
In [40]:
print(filtered_data.describe())
             Price        Volume    Market Cap         year        month  \
count  2351.000000  2.351000e+03  2.351000e+03  2351.000000  2351.000000   
mean    747.455660  7.704654e+09  8.652092e+10  2019.236070     7.049341   
std     848.214629  8.461588e+09  1.019509e+11     2.347836     3.108005   
min       0.439769  1.692088e+05  3.259030e+07  2015.000000     2.000000   
25%     145.160874  6.050580e+08  1.555304e+10  2017.000000     4.000000   
50%     300.570344  6.067278e+09  2.941204e+10  2019.000000     7.000000   
75%    1520.467072  1.157127e+10  1.814142e+11  2021.000000    10.000000   
max    3644.405517  4.495794e+10  4.320824e+11  2023.000000    12.000000   

               day      weekday  price_volume_interaction  \
count  2351.000000  2351.000000              2.351000e+03   
mean     15.816674     2.994470              1.007741e+13   
std       8.765891     2.005198              1.804724e+13   
min       1.000000     0.000000              1.473332e+05   
25%       8.000000     1.000000              1.690774e+11   
50%      16.000000     3.000000              1.404763e+12   
75%      23.000000     5.000000              1.150871e+13   
max      31.000000     6.000000              1.089137e+14   

       marketcap_volume_ratio  price_change  ...  ma_price_x_ma_volume  \
count             2351.000000   2351.000000  ...          2.351000e+03   
mean                38.811834      0.546849  ...          1.050983e+13   
std                 54.357068     36.972100  ...          1.859886e+13   
min                  0.611666   -186.887860  ...          2.370337e+05   
25%                  4.780971     -5.804794  ...          1.780359e+11   
50%                 19.145416     -0.006202  ...          1.476849e+12   
75%                 49.499925      7.184352  ...          1.226244e+13   
max                419.478183    180.953324  ...          1.103269e+14   

       rsi_x_price_change_pct  return_volume_ratio  rsi_squared  \
count             2351.000000         2.351000e+03  2351.000000   
mean                 0.239944         4.277942e-09  2855.954230   
std                  2.213093         9.202788e-09  1765.815076   
min                 -8.585443         5.143438e-11    19.410972   
25%                 -0.887301         1.457301e-10  1419.928232   
50%                 -0.007796         3.175254e-10  2534.723946   
75%                  1.081301         5.009221e-09  4034.377010   
max                 10.074481         8.490811e-08  8230.783038   

           rsi_cubed  rsi_squared_x_price        is_Q4  is_start_of_year  \
count    2351.000000         2.351000e+03  2351.000000            2351.0   
mean   174155.280081         2.062944e+06     0.266695               0.0   
std    153430.319760         2.813506e+06     0.442326               0.0   
min        85.520637         2.214744e+02     0.000000               0.0   
25%     53505.746367         1.879255e+05     0.000000               0.0   
50%    127613.318242         8.361186e+05     0.000000               0.0   
75%    256250.520678         2.742979e+06     1.000000               0.0   
max    746726.786952         1.556261e+07     1.000000               0.0   

       Q4_volume_change  high_volume_price_change  
count      2.351000e+03               2351.000000  
mean       1.086705e+06                  0.825408  
std        8.633646e+08                 35.290293  
min       -5.536621e+09               -186.887860  
25%        0.000000e+00                  0.000000  
50%        0.000000e+00                  0.000000  
75%       -0.000000e+00                  0.174759  
max        5.550065e+09                180.953324  

[8 rows x 42 columns]

Visualizing Ethereum Data¶

This section demonstrates various ways to visualize Ethereum data using Python libraries such as Seaborn, Plotly, and Dash. Each block of code is designed to provide insights into different aspects of Ethereum's price and other attributes over time.

Line Plot: Price over Time with Seaborn¶

Here, we create a simple line plot using Seaborn to visualize how Ethereum's price has changed over time. This can help in identifying trends or significant changes in the market.

Line Plot: Price over Time

In [41]:
import seaborn as sns
import matplotlib.pyplot as plt
ed['date'] = pd.to_datetime(ed['date'])
ed = ed.sort_values(by='date')
sns.set(style="darkgrid")
plt.figure(figsize=(10, 6))
sns.lineplot(x='date', y='Price', data=ed)
plt.title('Price over Time')
plt.xticks(rotation=45)
plt.show()

Interactive Line Plot: Price over Time with Plotly Using Plotly, we can make the visualization interactive, which is particularly useful for web-based dashboards.

In [42]:
import plotly.express as px
import pandas as pd

ed['date'] = pd.to_datetime(ed['date'])
ed = ed.sort_values(by='date')

fig = px.line(ed, x='date', y='Price', title='Price Trend Over Time',
              labels={'date': 'Date', 'Price': 'Price in USD'},
              line_shape='linear',
              render_mode='svg') 

fig.update_layout(
    title_font_size=20,
    title_x=0.5,
    xaxis_title_font_size=14,
    yaxis_title_font_size=14,
    xaxis_tickangle=-45 
)

fig.show()

Interactive Dashboard with Dash

This example sets up a basic Dash application that allows users to select different features of Ethereum data to display on a line chart.

In [45]:
import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
import plotly.express as px
import pandas as pd



app = dash.Dash(__name__)

app.layout = html.Div([
    dcc.Graph(id='time-series-chart'),
    html.Label('Select Feature:'),
    dcc.Dropdown(
        id='feature-selector',
        options=[{'label': i, 'value': i} for i in ethereum_data.columns if i not in ['Timestamp', 'year', 'month', 'day', 'weekday', 'hour']],
        value='Price'
    )
])

@app.callback(
    Output('time-series-chart', 'figure'),
    [Input('feature-selector', 'value')]
)
def update_graph(selected_feature):
    fig = px.line(ethereum_data, x='date', y=selected_feature, title=f'{selected_feature} Over Time')
    return fig

if __name__ == '__main__':
    app.run_server(debug=True, port=8052)

3D Scatter Plot: Price, Volume, and Market Cap

This Plotly visualization creates a 3D scatter plot to examine the relationships between price, volume, and market cap across different years.

In [46]:
import plotly.express as px

fig = px.scatter_3d(
    ethereum_data, 
    x='Price', 
    y='Volume', 
    z='Market Cap', 
    color='year'  
)

fig.update_layout(
    title='3D Scatter Plot of Ethereum Price, Volume, and Market Cap',
    scene=dict(
        xaxis_title='Price USD',
        yaxis_title='Volume',
        zaxis_title='Market Cap'
    )
)

fig.show()

Bar Plot: Average Price per Month

In [47]:
plt.figure(figsize=(10, 6))
sns.barplot(x='month', y='Price', data=ed)
plt.title('Average Price per Month')
plt.show()

Histogram: Price Distribution

In [48]:
plt.figure(figsize=(10, 6))
sns.histplot(ed['Price'], bins=30)
plt.title('Price Distribution')
plt.show()

A seaborn histogram provides insights into the distribution of Ethereum prices, such as the range of prices and any skewness in the data.

Scatter Plot: Price vs. Volume

In [49]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Price', y='Volume', data=ed)
plt.title('Price vs. Volume')
plt.show()

This seaborn scatter plot explores the relationship between Ethereum's price and trading volume, potentially revealing correlation patterns.

Box Plot: Price Distribution by Quarter

In [50]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='quarter', y='Price', data=ed)
plt.title('Price Distribution by Quarter')
plt.show()

The seaborn box plot divides the Ethereum price data by quarters, showing how prices vary throughout the year and identifying any outliers.

Correlation Heatmap

In [51]:
plt.figure(figsize=(10, 10))
sns.heatmap(ed[['Price', 'Volume', 'Market Cap', 'price_7day_avg', 'volume_7day_avg']].corr(), annot=True, fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()

A seaborn heatmap is used to visualize the correlation between different features like 'Price', 'Volume', 'Market Cap', etc. It helps to identify which features are most strongly related to each other.

Pair Plot

In [52]:
sns.pairplot(ed[['Price', 'Volume', 'Market Cap']])
plt.show()

A seaborn pair plot offers a comprehensive view of bivariate relationships between multiple features ('Price', 'Volume', 'Market Cap'), including scatter plots and histograms.

Violin Plot: Price Distribution by Weekday

This seaborn violin plot shows the distribution of Ethereum prices across different weekdays, helping to identify any weekly patterns.

In [53]:
plt.figure(figsize=(10, 6))
sns.violinplot(x='weekday', y='Price', data=ed)
plt.title('Price Distribution by Weekday')
plt.show()

This seaborn violin plot shows the distribution of Ethereum prices across different weekdays, helping to identify any weekly patterns.

Facet Grid: Price Trend by Quarter

In [54]:
g = sns.FacetGrid(ed, col='quarter', height=4, aspect=1)
g = g.map(plt.plot, 'date', 'Price')
plt.show()

A seaborn FacetGrid is used to create a series of line plots, each representing Ethereum price trends in different quarters, allowing for a comparative analysis across quarters.

Density Plot: Price Distribution

In [55]:
plt.figure(figsize=(10, 6))
sns.kdeplot(ed['Price'], shade=True)
plt.title('Density Plot for Price')
plt.show()
C:\Users\Luke Holmes\AppData\Local\Temp\ipykernel_2688\2343393537.py:2: FutureWarning:



`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.


A seaborn KDE plot visualizes the density distribution of Ethereum prices, providing a smoothed representation of the data.

Seasonal Decomposition

In [56]:
from statsmodels.tsa.seasonal import seasonal_decompose

ed['date'] = pd.to_datetime(ed['date'])
ed.set_index('date', inplace=True)

result = seasonal_decompose(ed['Price'], model='additive', period=365)
result.plot()
plt.show()

Using statsmodels, a seasonal decomposition of Ethereum prices is conducted to separate the time series into trend, seasonality, and residuals.

Swarm Plot: Price Distribution by Weekday

In [57]:
plt.figure(figsize=(10, 6))
sns.swarmplot(x='weekday', y='Price', data=ed)
plt.title('Price Distribution by Weekday')
plt.show()
C:\Users\Luke Holmes\anaconda3\Lib\site-packages\seaborn\categorical.py:3544: UserWarning:

38.1% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.

C:\Users\Luke Holmes\anaconda3\Lib\site-packages\seaborn\categorical.py:3544: UserWarning:

40.7% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.

C:\Users\Luke Holmes\anaconda3\Lib\site-packages\seaborn\categorical.py:3544: UserWarning:

39.4% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.

C:\Users\Luke Holmes\anaconda3\Lib\site-packages\seaborn\categorical.py:3544: UserWarning:

40.1% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.

C:\Users\Luke Holmes\anaconda3\Lib\site-packages\seaborn\categorical.py:3544: UserWarning:

38.7% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.

C:\Users\Luke Holmes\anaconda3\Lib\site-packages\seaborn\categorical.py:3544: UserWarning:

38.4% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.

C:\Users\Luke Holmes\anaconda3\Lib\site-packages\seaborn\categorical.py:3544: UserWarning:

39.7% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.

C:\Users\Luke Holmes\anaconda3\Lib\site-packages\seaborn\categorical.py:3544: UserWarning:

41.8% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.

C:\Users\Luke Holmes\anaconda3\Lib\site-packages\seaborn\categorical.py:3544: UserWarning:

41.0% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.

C:\Users\Luke Holmes\anaconda3\Lib\site-packages\seaborn\categorical.py:3544: UserWarning:

41.3% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.

C:\Users\Luke Holmes\anaconda3\Lib\site-packages\seaborn\categorical.py:3544: UserWarning:

40.6% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.

C:\Users\Luke Holmes\anaconda3\Lib\site-packages\seaborn\categorical.py:3544: UserWarning:

39.8% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.

This seaborn swarm plot offers a detailed view of how Ethereum prices vary on different weekdays. It provides insights into weekly trends and price dispersion.

Pair Grid: Comprehensive Analysis of Price, Volume, and Market Cap

In [59]:
g = sns.PairGrid(ed[['Price', 'Volume', 'Market Cap']])
g.map_upper(sns.scatterplot)
g.map_lower(sns.kdeplot, colors='blue')
g.map_diag(sns.histplot, kde=True)
plt.show()

Candlestick Chart Using Plotly Lastly, we create a candlestick chart to visualize Ethereum price movements in a more detailed and visually appealing manner.

In [ ]:
ed['date'] = ed['date'].dt.strftime('%Y-%m-%d')

import plotly.graph_objects as go

fig = go.Figure(data=[go.Candlestick(x=ed['date'],
                                     open=ed['Price'],  
                                     high=ed['Price']*1.02,  # Simulated high price (2% higher)
                                     low=ed['Price']*0.98,  # Simulated low price (2% lower)
                                     close=ed['Price'])])

fig.update_layout(title='Ethereum Price Candlestick Chart', xaxis_title='Date', yaxis_title='Price (USD)')
fig.show()

Stacking Regressor for Ethereum Price Prediction

We employ a stacking approach, combining multiple regression models to improve the prediction accuracy.

In [60]:
from sklearn.ensemble import StackingRegressor, RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split, RandomizedSearchCV
import numpy as np
import pandas as pd


ethereum_data = pd.DataFrame({
    'Volume': np.random.rand(100),
    'Market Cap': np.random.rand(100),
    'year': np.random.randint(2015, 2023, 100),
    'month': np.random.randint(1, 13, 100),
    'day': np.random.randint(1, 32, 100),
    'Price': np.random.rand(100) * 1000
})

X = ethereum_data[['Volume', 'Market Cap', 'year', 'month', 'day']]
y = ethereum_data['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Block 2: Configure and Train Stacking Regressor

In [61]:
estimators = [
    ('lr', LinearRegression()),
    ('dt', DecisionTreeRegressor(max_depth=5)),
    ('svr', SVR(kernel='linear', C=0.1))
]

stacking_regressor = StackingRegressor(estimators=estimators, final_estimator=LinearRegression())
stacking_regressor.fit(X_train, y_train)
print('Stacking Model Score:', stacking_regressor.score(X_test, y_test))
Stacking Model Score: -0.0553694531956892

Optimizing RandomForestRegressor with RandomizedSearchCV

This block configures a RandomizedSearchCV to find the best hyperparameters for a RandomForestRegressor, aiming to improve model performance.

In [62]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
import numpy as np

param_distributions = {
    'n_estimators': np.arange(100, 301, 100),
    'max_depth': [None, 10, 20],
    'min_samples_split': [5, 10],
    'min_samples_leaf': [2, 4],
    'max_features': [None, 'sqrt', 'log2']  # Changed 'auto' to None
}

rf = RandomForestRegressor(random_state=42)
rf_random = RandomizedSearchCV(
    estimator=rf, 
    param_distributions=param_distributions, 
    n_iter=30, 
    cv=2, 
    verbose=2, 
    random_state=42, 
    n_jobs=-1,
    error_score=np.nan  # Continue on error with nan score
)

X_train, y_train = np.random.rand(100, 5), np.random.rand(100)  # Example data, replace with your actual data
rf_random.fit(X_train, y_train)
Fitting 2 folds for each of 30 candidates, totalling 60 fits
Out[62]:
RandomizedSearchCV(cv=2, estimator=RandomForestRegressor(random_state=42),
                   n_iter=30, n_jobs=-1,
                   param_distributions={'max_depth': [None, 10, 20],
                                        'max_features': [None, 'sqrt', 'log2'],
                                        'min_samples_leaf': [2, 4],
                                        'min_samples_split': [5, 10],
                                        'n_estimators': array([100, 200, 300])},
                   random_state=42, verbose=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomizedSearchCV(cv=2, estimator=RandomForestRegressor(random_state=42),
                   n_iter=30, n_jobs=-1,
                   param_distributions={'max_depth': [None, 10, 20],
                                        'max_features': [None, 'sqrt', 'log2'],
                                        'min_samples_leaf': [2, 4],
                                        'min_samples_split': [5, 10],
                                        'n_estimators': array([100, 200, 300])},
                   random_state=42, verbose=2)
RandomForestRegressor(random_state=42)
RandomForestRegressor(random_state=42)

Block 4: Output Feature Importance from Best Model

In [65]:
X = ethereum_data.drop('Price', axis=1) 
y = ethereum_data['Price']  
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In [67]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_features': ['sqrt', 'log2'], 
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2, 4]
}
grid_search = GridSearchCV(estimator=RandomForestRegressor(), param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)
print('Best Parameters:', grid_search.best_params_)
Fitting 3 folds for each of 144 candidates, totalling 432 fits
Best Parameters: {'max_depth': 10, 'max_features': 'log2', 'min_samples_leaf': 4, 'min_samples_split': 5, 'n_estimators': 50}
In [68]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

ed.fillna(method='ffill', inplace=True)

X = ed[['Volume', 'Market Cap', 'year', 'month', 'day']]
y = ed['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f'Root Mean Squared Error: {rmse}')
Root Mean Squared Error: 39.47773162973353

ARIMA Model for Time Series Forecasting

To further our analysis, we apply an ARIMA model to forecast Ethereum prices based on historical data.

In [70]:
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA

ed['date'] = pd.to_datetime(ed[['year', 'month', 'day']])

ed.set_index('date', inplace=True)
ed.index = pd.DatetimeIndex(ed.index).to_period('D')

model_arima = ARIMA(ed['Price'], order=(5,1,0))
model_arima_fit = model_arima.fit()

print(model_arima_fit.summary())
                               SARIMAX Results                                
==============================================================================
Dep. Variable:                  Price   No. Observations:                 2964
Model:                 ARIMA(5, 1, 0)   Log Likelihood              -16537.729
Date:                Thu, 02 May 2024   AIC                          33087.458
Time:                        10:26:32   BIC                          33123.422
Sample:                    10-21-2015   HQIC                         33100.403
                         - 12-02-2023                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1         -0.0654      0.009     -7.557      0.000      -0.082      -0.048
ar.L2          0.0273      0.008      3.597      0.000       0.012       0.042
ar.L3          0.0251      0.008      3.112      0.002       0.009       0.041
ar.L4          0.0286      0.008      3.508      0.000       0.013       0.045
ar.L5         -0.0604      0.007     -8.566      0.000      -0.074      -0.047
sigma2      4132.4564     34.452    119.947      0.000    4064.931    4199.982
===================================================================================
Ljung-Box (L1) (Q):                   0.10   Jarque-Bera (JB):             62028.38
Prob(Q):                              0.76   Prob(JB):                         0.00
Heteroskedasticity (H):              16.84   Skew:                            -0.97
Prob(H) (two-sided):                  0.00   Kurtosis:                        25.33
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

This section covers the use of Scikit-learn's train_test_split to divide the data into training and test sets, and LinearRegression for modeling Ethereum prices. The model is trained on selected features like 'Volume', 'Market Cap', and date components, and its performance is evaluated using the root mean squared error (RMSE).

Ridge Regression Implementation¶

In this section, we apply Ridge Regression to our dataset. Ridge Regression is a type of linear regression that includes a regularization term. This regularization term (L2 penalty) discourages learning overly complex models to prevent overfitting. We scale our features using StandardScaler to normalize the data, ensuring that the model isn't biased towards variables on a larger scale.

In [ ]:
 
In [73]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
import numpy as np

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

ridge_reg = Ridge(alpha=1.0)
ridge_reg.fit(X_train_scaled, y_train)
y_pred_ridge = ridge_reg.predict(X_test_scaled)
rmse_ridge = np.sqrt(mean_squared_error(y_test, y_pred_ridge))

print(f'Ridge Regression RMSE: {rmse_ridge}')
Ridge Regression RMSE: 39.52609293882694

Hyperparameter Tuning with Hyperopt¶

Here we use Hyperopt, a library for serial and parallel optimization over awkward search spaces, which may include real-valued, discrete, and conditional dimensions. We define an objective function to minimize, set up a space of hyperparameters, and use the Tree-structured Parzen Estimator (TPE) method to find the best hyperparameters for a RandomForestRegressor model.

In [74]:
from hyperopt import hp, fmin, tpe, STATUS_OK, Trials

def objective(space):
    model = RandomForestRegressor(n_estimators=int(space['n_estimators']),
                                  max_depth=int(space['max_depth']),
                                  min_samples_split=int(space['min_samples_split']),
                                  min_samples_leaf=int(space['min_samples_leaf']))
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    mse = mean_squared_error(y_test, pred)
    return {'loss': mse, 'status': STATUS_OK}

space = {
    'n_estimators': hp.quniform('n_estimators', 100, 1000, 100),
    'max_depth': hp.quniform('max_depth', 10, 50, 10),
    'min_samples_split': hp.choice('min_samples_split', [2, 5, 10]),
    'min_samples_leaf': hp.choice('min_samples_leaf', [1, 2, 4])
}

trials = Trials()
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=100,
            trials=trials)

print("Best hyperparameters:", best)
100%|██████████| 100/100 [23:23<00:00, 14.03s/trial, best loss: 195.38881312438133]
Best hyperparameters: {'max_depth': 50.0, 'min_samples_leaf': 0, 'min_samples_split': 0, 'n_estimators': 900.0}

Sharpe Ratio Calculation¶

The Sharpe ratio is used to measure the performance of an investment compared to a risk-free asset, after adjusting for its risk. It is the average return earned in excess of the risk-free rate per unit of volatility or total risk. Calculating the Sharpe ratio is useful for understanding the return of an investment compared to its risk.

In [104]:
import numpy as np

def sharpe_ratio(returns):
    mean_returns = np.mean(returns)
    std_returns = np.std(returns)
    sharpe_ratio = mean_returns / std_returns * np.sqrt(252)  # Assuming daily returns
    return sharpe_ratio

ed['returns'] = ed['Price'].pct_change()
print("Sharpe Ratio:", sharpe_ratio(ed['returns'].dropna()))
Sharpe Ratio: 1.2621260038434017

Lasso

In [76]:
from sklearn.linear_model import Lasso

lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X_train, y_train)
y_pred_lasso = lasso_reg.predict(X_test)
rmse_lasso = np.sqrt(mean_squared_error(y_test, y_pred_lasso))
print(f'Lasso Regression RMSE: {rmse_lasso}')
Lasso Regression RMSE: 39.47820594449236

Implementing Decision Tree Regression

In [77]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(X_train, y_train)
y_pred_tree = tree_reg.predict(X_test)
rmse_tree = np.sqrt(mean_squared_error(y_test, y_pred_tree))
print(f'Decision Tree Regression RMSE: {rmse_tree}')
Decision Tree Regression RMSE: 14.733249675918618

In this section, we explore the Decision Tree Regression model, known for its ability to capture complex, non-linear relationships in data. After training the model on the Ethereum dataset, we evaluate its performance using the RMSE metric, providing insights into its effectiveness compared to simpler models.

Random Forest Regressor

In [78]:
from sklearn.ensemble import RandomForestRegressor

rf_reg = RandomForestRegressor(n_estimators=100)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
print(f'Random Forest Regression RMSE: {rmse_rf}')
Random Forest Regression RMSE: 14.261140429352936

This section focuses on Random Forest Regression, an advanced ensemble method that combines multiple decision trees to enhance predictive accuracy and robustness. After fitting the model to our Ethereum dataset, we assess its performance using the RMSE value, comparing it against previous models to gauge its relative effectiveness. Blow is a Residual Plot to visualise the distribution of errors:

In [79]:
import matplotlib.pyplot as plt
rf_reg = RandomForestRegressor(n_estimators=100)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
residuals = y_test - y_pred_rf
plt.scatter(y_pred_rf, residuals)
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot for Random Forest Regression')
plt.axhline(y=0, color='r', linestyle='--')
plt.show()

SVR

In [80]:
from sklearn.svm import SVR

svr_reg = SVR(kernel='rbf')
svr_reg.fit(X_train, y_train)
y_pred_svr = svr_reg.predict(X_test)
rmse_svr = np.sqrt(mean_squared_error(y_test, y_pred_svr))
print(f'Support Vector Regression RMSE: {rmse_svr}')
Support Vector Regression RMSE: 707.9913315066016

In this section, we implement Support Vector Regression (SVR), a versatile machine learning algorithm, on the Ethereum dataset. SVR is known for its effectiveness in handling non-linear relationships. We employ the Radial Basis Function (RBF) kernel and evaluate the model's performance using the RMSE metric, providing insights into its predictive accuracy.

Gradient Boosting Regression Implementation

In [81]:
from sklearn.ensemble import GradientBoostingRegressor

gb_reg = GradientBoostingRegressor(n_estimators=100)
gb_reg.fit(X_train, y_train)
y_pred_gb = gb_reg.predict(X_test)
rmse_gb = np.sqrt(mean_squared_error(y_test, y_pred_gb))
print(f'Gradient Boosting Regression RMSE: {rmse_gb}')
Gradient Boosting Regression RMSE: 15.29218574410884

This section focuses on the implementation of Gradient Boosting Regression, a powerful ensemble learning technique. Gradient Boosting Regression builds an additive model in a forward stage-wise fashion, allowing for the optimization of arbitrary differentiable loss functions. The model is trained on Ethereum dataset features with 100 estimators, and its effectiveness is evaluated using the Root Mean Squared Error (RMSE) metric.

Model Performance Comparison

In [82]:
print(f'Ridge Regression RMSE: {rmse_ridge}')
print(f'Lasso Regression RMSE: {rmse_lasso}')
print(f'Decision Tree Regression RMSE: {rmse_tree}')
print(f'Random Forest Regression RMSE: {rmse_rf}')
print(f'Support Vector Regression RMSE: {rmse_svr}')
print(f'Gradient Boosting Regression RMSE: {rmse_gb}')
Ridge Regression RMSE: 39.52609293882694
Lasso Regression RMSE: 39.47820594449236
Decision Tree Regression RMSE: 14.733249675918618
Random Forest Regression RMSE: 14.261140429352936
Support Vector Regression RMSE: 707.9913315066016
Gradient Boosting Regression RMSE: 15.29218574410884

In this section, the performance of all implemented models is compared using the Root Mean Squared Error (RMSE) metric. This comparison is crucial for determining the most effective model for predicting Ethereum prices. The RMSE values for Ridge Regression, Lasso Regression, Decision Tree Regression, Random Forest Regression, Support Vector Regression, and Gradient Boosting Regression are displayed, providing insights into each model's accuracy and predictive power.

Custom Accuracy Metric

In [83]:
def custom_accuracy(y_true, y_pred, threshold=0.01):
    """
    Calculate the percentage of predictions within a certain threshold.

    :param y_true: Actual values
    :param y_pred: Predicted values
    :param threshold: Threshold for considering a prediction accurate (default 1%)
    :return: Accuracy as a percentage
    """
    within_threshold = np.abs(y_true - y_pred) <= threshold * np.abs(y_true)
    accuracy = np.mean(within_threshold)
    return accuracy * 100

accuracy = custom_accuracy(y_test, y_pred_rf)
print(f'Custom Accuracy: {accuracy:.2f}%')
Custom Accuracy: 69.48%

This section introduces a custom accuracy metric designed to evaluate the model's predictions based on a specified threshold. The custom accuracy function, named custom_accuracy, calculates the percentage of predictions that fall within a certain margin (threshold) of the actual values. This metric is particularly useful for understanding the practical effectiveness of the model in scenarios where slight deviations from the actual values are acceptable.

Cross-Validation with Random Forest Regressor¶

We begin by performing cross-validation on a Random Forest Regressor to evaluate its performance more robustly. This is crucial to ensure that our model is not only fitting to a particular subset of the data. Here, we use the cross_val_score function with 5 folds, which will provide us with insights into the model's stability and performance across different subsets of the data. We use the negative mean squared error as the scoring method and calculate the RMSE (Root Mean Squared Error) for each fold to get a sense of the average error magnitude.

In [84]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
rf_reg = RandomForestRegressor(n_estimators=100)
cv_scores = cross_val_score(rf_reg, X, y, cv=5, scoring='neg_mean_squared_error')
rmse_scores = np.sqrt(-cv_scores)
print("RMSE scores for each fold:", rmse_scores)
print(f"Mean RMSE: {np.mean(rmse_scores)}")
print(f"Standard Deviation of RMSE: {np.std(rmse_scores)}")
RMSE scores for each fold: [ 71.66345284  71.33436172  17.16693587 481.22333446  87.74440581]
Mean RMSE: 145.82649813889662
Standard Deviation of RMSE: 169.39131376842155

Time Series Cross-Validation¶

Next, we employ time series cross-validation to evaluate the Random Forest model. This method is particularly useful when dealing with time series data, as it respects the temporal order of observations. We use the TimeSeriesSplit from sklearn, specifying 5 splits, and print out the score for each fold to observe how the model performs over time.

In [105]:
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
model = RandomForestRegressor(**grid_search.best_params_)

for train_index, test_index in tscv.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    model.fit(X_train, y_train)
    print('Fold Score:', model.score(X_test, y_test))
Fold Score: -1.9445148450106546
Fold Score: -4.397333022336436
Fold Score: 0.6594040240142576
Fold Score: -2.519497914536573
Fold Score: -2.1244451417978993

Randomized Search for Hyperparameter Tuning¶

We enhance our model tuning by conducting a Randomized Search for the best hyperparameters. Randomized Search offers a probabilistic approach that searches the parameter space more efficiently than GridSearchCV. Here, we define a range of values for hyperparameters and use RandomizedSearchCV to find the best combination based on the specified distribution.

In [86]:
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.ensemble import RandomForestRegressor

param_random = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_features': ['sqrt', 'log2'],
    'max_depth': [10, 20, 30, 40, 50, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

rf_random = RandomForestRegressor()

random_search = RandomizedSearchCV(estimator=rf_random, param_distributions=param_random, 
                                   n_iter=100, cv=3, verbose=2, random_state=42, n_jobs=-1)
random_search.fit(X_train, y_train)
print("Randomized Search Best Parameters:", random_search.best_params_)

max_depth = random_search.best_params_['max_depth']
if max_depth is None:
    max_depth_values = [None]  
else:
    max_depth_values = [max_depth - 10 if max_depth > 10 else 5, max_depth, max_depth + 10]

param_grid = {
    'n_estimators': [random_search.best_params_['n_estimators']],
    'max_features': [random_search.best_params_['max_features']],
    'max_depth': max_depth_values,
    'min_samples_split': [random_search.best_params_['min_samples_split']],
    'min_samples_leaf': [random_search.best_params_['min_samples_leaf']]
}

grid_search = GridSearchCV(estimator=RandomForestRegressor(), param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)
print('Grid Search Best Parameters:', grid_search.best_params_)
Fitting 3 folds for each of 100 candidates, totalling 300 fits
Randomized Search Best Parameters: {'n_estimators': 100, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 10, 'bootstrap': False}
Fitting 3 folds for each of 3 candidates, totalling 9 fits
Grid Search Best Parameters: {'max_depth': 20, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100}

Randomized Search for Hyperparameter Tuning¶

We enhance our model tuning by conducting a Randomized Search for the best hyperparameters. Randomized Search offers a probabilistic approach that searches the parameter space more efficiently than GridSearchCV. Here, we define a range of values for hyperparameters and use RandomizedSearchCV to find the best combination based on the specified distribution.

In [87]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor

param_grid = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_features': ['sqrt', 'log2'],  # Update 'max_features' here
    'max_depth': [10, 20, 30, 40, 50, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

rf = RandomForestRegressor()

rf_random = RandomizedSearchCV(estimator=rf, param_distributions=param_grid, 
                               n_iter=100, cv=3, verbose=2, random_state=42, n_jobs=-1,
                               error_score='raise')


rf_random.fit(X_train, y_train)

print("Best Parameters:", rf_random.best_params_)
Fitting 3 folds for each of 100 candidates, totalling 300 fits
Best Parameters: {'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': 50, 'bootstrap': False}

Feature Importance from the Best Model¶

After finding the best hyperparameters, we train a Random Forest Regressor using these parameters and examine the feature importances. This allows us to see which features are most influential in predicting the Ethereum price.

In [106]:
from sklearn.ensemble import RandomForestRegressor
best_params = {
    'n_estimators': rf_random.best_params_['n_estimators'],
    'max_features': rf_random.best_params_['max_features'],
    'max_depth': rf_random.best_params_['max_depth'],
    'min_samples_split': rf_random.best_params_['min_samples_split'],
    'min_samples_leaf': rf_random.best_params_['min_samples_leaf'],
    'bootstrap': rf_random.best_params_['bootstrap']
}
final_rf_reg = RandomForestRegressor(**best_params)
final_rf_reg.fit(X_train, y_train)
y_pred_rf_final = final_rf_reg.predict(X_test)
rmse_rf_final = np.sqrt(mean_squared_error(y_test, y_pred_rf_final))
print(f'Random Forest Regression RMSE (with best hyperparameters): {rmse_rf_final}')
Random Forest Regression RMSE (with best hyperparameters): 289.4496998373119

Time Series Cross-Validation¶

In this section, we utilize TimeSeriesSplit from sklearn to perform time series cross-validation. This is particularly suitable for time series data to validate the model in a way that respects the temporal order of observations. We use the best estimator from a previous RandomizedSearchCV, and calculate the negative mean squared error across each fold. We then compute the root mean squared error (RMSE) for each split to assess the model's performance over time.

In [89]:
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(rf_random.best_estimator_, X_train, y_train, cv=tscv, scoring='neg_mean_squared_error')
print("Time-series CV scores:", np.sqrt(-scores))
Time-series CV scores: [376.93872734 193.41857629  16.38233049 895.52961301 994.12721737]

Model Backtesting¶

After cross-validation, we perform a backtest by splitting the data at a certain point in time (70% of the data for training and the rest for testing). This method is commonly used in financial modeling to simulate the model's performance on unseen data as if it were being used in practice.

In [90]:
split_index = int(len(X) * 0.7)
X_train_bt, X_test_bt = X[:split_index], X[split_index:]
y_train_bt, y_test_bt = y[:split_index], y[split_index:]

rf_random.best_estimator_.fit(X_train_bt, y_train_bt)
predictions = rf_random.best_estimator_.predict(X_test_bt)

mse = mean_squared_error(y_test_bt, predictions)
print("Backtest MSE:", mse)
Backtest MSE: 318962.1721488143

Visualizing Predictions with Plotly¶

Finally, we visualize the actual vs. predicted prices using Plotly, a powerful library for creating interactive charts. This visualization helps in understanding the accuracy of the predictions in a more intuitive and graphical format.

In [91]:
import plotly.graph_objs as go

fig = go.Figure()
fig.add_trace(go.Scatter(x=np.arange(len(y_test_bt)), y=y_test_bt, mode='lines', name='Actual'))
fig.add_trace(go.Scatter(x=np.arange(len(y_test_bt)), y=predictions, mode='lines', name='Predicted'))
fig.update_layout(title='Actual vs Predicted Prices', xaxis_title='Index', yaxis_title='Price')
fig.show()

Setup and Model Training with Imputation¶

This section initializes machine learning models and applies data imputation to handle missing values in the dataset. We use SimpleImputer to replace missing values with the median of each column. We then train two different models: RandomForestRegressor and LinearRegression, to predict our target variable. The root mean squared error (RMSE) is calculated for each model to evaluate their performance.

In [92]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error

models = {
    'RandomForestRegressor': RandomForestRegressor(),
    'LinearRegression': LinearRegression()
}
In [93]:
from sklearn.impute import SimpleImputer

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

imputer = SimpleImputer(strategy='median')

X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)



results = {}
importances = {}

for name, model in models.items():
    pipeline = make_pipeline(SimpleImputer(strategy='median'), model)
    pipeline.fit(X_train_imputed, y_train)
    y_pred = pipeline.predict(X_test_imputed)
    
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    results[name] = rmse

    print(f'{name} RMSE: {rmse:.4f}')

    if hasattr(model, 'feature_importances_'):
        importances[name] = model.feature_importances_

if importances:
    for name, importance in importances.items():
        features = X_train.columns
        importance_df = pd.DataFrame({'Feature': features, 'Importance': importance}).sort_values(by='Importance', ascending=False)
        print(f'\n{name} Feature Importances:')
        print(importance_df)
RandomForestRegressor RMSE: 14.2672
LinearRegression RMSE: 39.4777

RandomForestRegressor Feature Importances:
      Feature  Importance
1  Market Cap    0.999622
2        year    0.000153
0      Volume    0.000147
3       month    0.000041
4         day    0.000036

Display RMSE Results¶

After training the models, we print out the RMSE results for each. This step helps in comparing the effectiveness of the RandomForestRegressor with LinearRegression based on their RMSE values.

In [94]:
rmse_results = {
    'LSTM 50x1': 0.02627,
    'LSTM 50x2': 0.02557,
    'LSTM 50x3': 0.04761,
    'LSTM 100x1': 0.02177,
    'LSTM 100x2': 0.02226,
    'LSTM 100x3': 0.04501,
    'Linear Regression': 39.4777,
    'Ridge Regression': 39.4777,
    'Lasso Regression': 39.4782,
    'Decision Tree': 17.4592,
    'Random Forest': 14.3826,
    'Support Vector Regression (SVR)': 707.9913,
    'Gradient Boosting': 15.2922,
    'RF Cross-Validation': 144.3496,
    'RF Best Params': 378.5620
}

# Printing the RMSE results
print("RMSE Results for Various Models:")
for model, rmse in rmse_results.items():
    print(f'{model}: {rmse:.4f}')
RMSE Results for Various Models:
LSTM 50x1: 0.0263
LSTM 50x2: 0.0256
LSTM 50x3: 0.0476
LSTM 100x1: 0.0218
LSTM 100x2: 0.0223
LSTM 100x3: 0.0450
Linear Regression: 39.4777
Ridge Regression: 39.4777
Lasso Regression: 39.4782
Decision Tree: 17.4592
Random Forest: 14.3826
Support Vector Regression (SVR): 707.9913
Gradient Boosting: 15.2922
RF Cross-Validation: 144.3496
RF Best Params: 378.5620

Model Evaluation Function¶

We also include a function to evaluate different metrics of the model's performance, such as RMSE, Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), R-squared, and accuracy within a certain threshold. This function helps in a comprehensive assessment of model performance.

In [95]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

def evaluate_model(y_true, y_pred):
    print("Shape of y_true before adjustment:", y_true.shape)
    print("Shape of y_pred before adjustment:", y_pred.shape)

    if y_true.shape[0] != y_pred.shape[0]:
        min_len = min(y_true.shape[0], y_pred.shape[0])
        y_true = y_true[:min_len]
        y_pred = y_pred[:min_len]

    if y_pred.ndim > 1:
        y_pred = y_pred.flatten()

    print("Shape of y_true after adjustment:", y_true.shape)
    print("Shape of y_pred after adjustment:", y_pred.shape)

    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    r2 = r2_score(y_true, y_pred)

    threshold = 0.05  # 5% threshold
    within_threshold = np.abs((y_true - y_pred) / y_true) <= threshold
    accuracy = np.mean(within_threshold) * 100

    print(f"RMSE: {rmse:.4f}")
    print(f"MAE: {mae:.4f}")
    print(f"MAPE: {mape:.2f}%")
    print(f"R-squared: {r2:.4f}")
    print(f"Accuracy (within {threshold*100}%): {accuracy:.2f}%")

evaluate_model(y_test, test_predict)
Shape of y_true before adjustment: (593,)
Shape of y_pred before adjustment: (591, 1)
Shape of y_true after adjustment: (591,)
Shape of y_pred after adjustment: (591,)
RMSE: 1406.6387
MAE: 1243.4258
MAPE: 8752.02%
R-squared: -0.7480
Accuracy (within 5.0%): 3.72%
In [96]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import numpy as np

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"R-squared: {r2:.4f}")

param_grid = {
    'n_estimators': [100, 200],  
    'max_features': ['sqrt'],  # Changed 'auto' to 'sqrt'
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 10],
    'min_samples_leaf': [1, 4]
}

random_search = RandomizedSearchCV(estimator=rf, param_distributions=param_grid, n_iter=10, cv=5, scoring='neg_mean_squared_error', verbose=1, n_jobs=-1)
random_search.fit(X_train, y_train)
print("Best parameters:", random_search.best_params_)

best_rf = random_search.best_estimator_
y_pred_best = best_rf.predict(X_test)
rmse_best = np.sqrt(mean_squared_error(y_test, y_pred_best))
print(f"Best RMSE: {rmse_best:.4f}")
RMSE: 14.0666
MAE: 6.1979
R-squared: 0.9998
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters: {'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 10}
Best RMSE: 28.9350

Initial Model Training and Evaluation¶

This section covers the initial setup, training, and evaluation of a RandomForestRegressor. We train the model on the training set and then make predictions on the test set. The model's performance is evaluated using the root mean squared error (RMSE), mean absolute error (MAE), and the R-squared score, which provides an indication of the goodness of fit of the predictions.

In [97]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import matplotlib.pyplot as plt
import numpy as np

# Prepare the data and split it
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model
try:
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
except Exception as e:
    print("An error occurred during model training or prediction:")
    print(e)
    raise

# Calculate performance metrics
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Model Performance Metrics:")
print(f"Root Mean Squared Error: {rmse:.4f}")
print(f"Mean Absolute Error: {mae:.4f}")
print(f"R-squared: {r2:.4f}")

# Define parameter grid
param_grid = {
    'n_estimators': [100, 200],
    'max_features': ['sqrt'],  # Corrected from 'auto' to 'sqrt'
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 10],
    'min_samples_leaf': [1, 4]
}

# Setup RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=rf, param_distributions=param_grid, n_iter=10, cv=5, scoring='neg_mean_squared_error', verbose=1, n_jobs=-1)
random_search.fit(X_train, y_train)

# Output the best parameters and the best RMSE
print("Best parameters found by RandomizedSearchCV:")
print(random_search.best_params_)

best_rf = random_search.best_estimator_
y_pred_best = best_rf.predict(X_test)
rmse_best = np.sqrt(mean_squared_error(y_test, y_pred_best))

print(f"Best RMSE from RandomizedSearchCV: {rmse_best:.4f}")
Model Performance Metrics:
Root Mean Squared Error: 14.0666
Mean Absolute Error: 6.1979
R-squared: 0.9998
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters found by RandomizedSearchCV:
{'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': None}
Best RMSE from RandomizedSearchCV: 24.2930

Visualization of Actual vs. Predicted Values¶

To visually assess the model's performance, we plot the actual vs. predicted values. This plot helps identify how well the predicted values match the actual values and highlights any potential areas where the model may be underperforming.

In [ ]:
# Plot Actual vs Predicted
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_best, alpha=0.75, color='red', edgecolors='b')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs. Predicted Values')
plt.show()

Re-evaluation with Best Parameters¶

After identifying the best parameters from the RandomizedSearchCV, we retrain the RandomForestRegressor with these optimized parameters and evaluate its performance again using the RMSE metric.

In [98]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

rf_reg = RandomForestRegressor(
    n_estimators=best_rf.get_params()['n_estimators'],
    max_features=best_rf.get_params()['max_features'],
    max_depth=best_rf.get_params()['max_depth'],
    min_samples_split=best_rf.get_params()['min_samples_split'],
    min_samples_leaf=best_rf.get_params()['min_samples_leaf'],
    random_state=42  
)

rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
print(f'Random Forest Regression RMSE: {rmse_rf}')
Random Forest Regression RMSE: 24.292987762035448

References

Python Software Foundation. (2023). Python 3.10.4 documentation. Available at: https://docs.python.org/3/. [Accessed 8 December 2023].

Pandas Development Team. (2023). pandas: powerful Python data analysis toolkit. Available at: https://pandas.pydata.org/pandas-docs/stable/. [Accessed 8 December 2023].

Harris, C.R., Millman, K.J., van der Walt, S.J. et al. (2020). Array programming with NumPy. Available at: https://numpy.org/doc/stable/. [Accessed 8 December 2023].

Hunter, J.D., Dale, D., Firing, E., Droettboom, M. (2023). Matplotlib: Visualization with Python. Available at: https://matplotlib.org/stable/users/index.html. [Accessed 8 December 2023].

Waskom, M.L. (2023). Seaborn: statistical data visualization. Available at: https://seaborn.pydata.org/. [Accessed 8 December 2023].

Virtanen, P., Gommers, R., Oliphant, T.E., et al. (2020). SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Available at: https://docs.scipy.org/doc/scipy/reference/. [Accessed 8 December 2023].

Pedregosa, F., Varoquaux, G., Gramfort, A., et al. (2011). Scikit-learn: Machine Learning in Python. Available at: https://scikit-learn.org/stable/. [Accessed 8 December 2023].

Seabold, S., Perktold, J. (2010). Statsmodels: Econometric and Statistical Modeling with Python. Available at: https://www.statsmodels.org/stable/index.html. [Accessed 8 December 2023].